本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新680篇论文,其中:

  • 自然语言处理126
  • 信息检索27
  • 计算机视觉119

自然语言处理

1. 【2606.05165】STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

链接https://arxiv.org/abs/2606.05165

作者:Rishit Dagli,Abir Harrasse,Luke Zhang,Florent Draye,Amirali Abdullah,Bernhard Schölkopf,Zhijing Jin

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, seeks to trace, Large Language, model predictions back, Data

备注: project page: [this https URL](https://stride-tda.github.io/)

点击查看摘要

Abstract:Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

2. 【2606.05161】Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

链接https://arxiv.org/abs/2606.05161

作者:Yichen Gao,Yiqun Zhang,Zijing Wang,Yujia Li,Heng Guo,Xi Wu,Xiaocui Yang,Shi Feng,Yifei Zhang,Daling Wang

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:Audio-language models, conflicting text, follow text, evidence is clear, Gated Audio Counterfactual

备注

点击查看摘要

Abstract:Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

3. 【2606.05158】Streaming Communication in Multi-Agent Reasoning

链接https://arxiv.org/abs/2606.05158

作者:Zhen Yang,Xiaogang Xu,Wen Wang,Cong Chen,Xander Xu,Ying-Cong Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Multi-agent reasoning systems, multi-agent reasoning system, reasoning systems adopt, paradigm that forces, Multi-agent reasoning

备注: project page: [this https URL](https://zhenyangcs.github.io/StreamMA-website/)

点击查看摘要

Abstract:Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

4. 【2606.05152】Reinforcement Learning from Rich Feedback with Distributional DAgger

链接https://arxiv.org/abs/2606.05152

作者:Rishabh Agrawal,Jacob Fein-Ashley,Paria Rashidinejad

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:recipe remains surprisingly, remains surprisingly narrow, single bit indicating, dominant reinforcement learning, advanced rapidly

备注

点击查看摘要

Abstract:Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

5. 【2606.05145】Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

链接https://arxiv.org/abs/2606.05145

作者:Nizar Islah,Istabrak Abbes,Irina Rish,Sarath Chandar,Eilif B. Muller

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:post-trained language models, language models fail, failed traces play, reasoning problems, additional attempts

备注

点击查看摘要

Abstract:When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

6. 【2606.05134】Activation-Based Active Learning for In-Context Learning: Challenges and Insights

链接https://arxiv.org/abs/2606.05134

作者:Yaseen M. Osman,Geoff V. Merrett,Stuart E. Middleton

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLM in-context sample, utilise recent advances, explored for LLM, Deep active learning, LLM in-context

备注: 9 pages, 3 figures

点击查看摘要

Abstract:Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

7. 【2606.05122】Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

链接https://arxiv.org/abs/2606.05122

作者:XiuYu Zhang,Yi Shan,Junfeng Fang,Zhenkai Liang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, raising a natural, natural question, increasingly evaluated

备注

点击查看摘要

Abstract:Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

8. 【2606.05121】Audio Interaction Model

链接https://arxiv.org/abs/2606.05121

作者:Zhifei Xie,Zihang Liu,Ze An,Xiaobin Hu,Yue Liao,Ziyang Ma,Dongchao Yang,Mingbao Lin,Deheng Ye,Shuicheng Yan,Chunyan Miao

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

关键词:Large Audio Language, today Large Audio, Audio Language Models, inherently interactive modality, today Large

备注: Next generation of LALMs, work in progress

点击查看摘要

Abstract:Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

9. 【2606.05115】Continual Visual and Verbal Learning Through a Child's Egocentric Input

链接https://arxiv.org/abs/2606.05115

作者:Xiaoyang Jiang,Yanlai Yang,Kenneth A. Norman,Brenden Lake,Mengye Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:temporally structured stream, temporally structured, meanings of words, Children learn, egocentric video recordings

备注: 15 pages, 4 figures

点击查看摘要

Abstract:Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

10. 【2606.05112】Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

链接https://arxiv.org/abs/2606.05112

作者:Cheng Liang,Pengcheng Qiu,Ya Zhang,Yanfeng Wang,Chaoyi Wu,Weidi Xie

类目:Computation and Language (cs.CL)

关键词:Large language models, dynamically delivers care, adapting longitudinal management, Large language, successive patient states

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

11. 【2606.05106】Arithmetic Pedagogy for Language Models

链接https://arxiv.org/abs/2606.05106

作者:Andhika Bernard Lumbantobing,Hokky Situngkir

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:human mathematics pedagogy, human mathematics, mathematics pedagogy, arithmetic reasoning, GASING method

备注: 18 pages, 6 figures

点击查看摘要

Abstract:We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

12. 【2606.05087】Light or Full Verb? A Minimal-Pair Dataset for Probing Phraseological Competence in Language Models

链接https://arxiv.org/abs/2606.05087

作者:Francesca Franzon,Nicolas Rosàs Gómez,Leo Wanner

类目:Computation and Language (cs.CL)

关键词:full lexical predicates, Frequent English verbs, Frequent English, make a decision, make a cake

备注

点击查看摘要

Abstract:Frequent English verbs such as 'have' and 'make' can function either as collocates in light-verb constructions or as full lexical predicates, as in 'make a decision' vs. 'make a cake'. Whether language models represent this distinction remains unclear. We introduce a large-scale controlled dataset of minimally varying English sentence series in which the same context contains the same verb in light-verb and full-verb uses. Two probing experiments show that language models differentiate between these uses even in minimal contexts and exhibit separable patterns across object types. We release the dataset, generation code, and materials as a reusable resource. The framework supports extensions to broader contexts, additional verbs, and other languages.

13. 【2606.05085】Automatic Generation of Titles for Research Papers Using Language Models

链接https://arxiv.org/abs/2606.05085

作者:Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:research paper conveys, concise manner, conveys its primary, primary idea, clear and concise

备注: 24 pages, 24 tables, 01 figure

点击查看摘要

Abstract:The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.

14. 【2606.05079】Fast Faithful Function Vectors

链接https://arxiv.org/abs/2606.05079

作者:Minh An Pham,Anton Segeler,Thomas Wiegand,Wojciech Samek,Sebastian Lapuschkin,Patrick Kahardipraja,Reduan Achtibat

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, steer Large Language, Language Models, Large Language, Function vectors

备注

点击查看摘要

Abstract:Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.

15. 【2606.05054】Boosting Self-Consistency with Ranking

链接https://arxiv.org/abs/2606.05054

作者:Maria Marina,Daniil Moskovskiy,Sergey Pletenev,Mikhail Salnikov,Alexander Panchenko,Viktor Moskvoretskii

类目:Computation and Language (cs.CL)

关键词:recover correct answers, improves large language, sampling multiple reasoning, multiple reasoning paths, reasoning paths

备注: 16 pages, 13 figures, accepted at ACL Student Research Workshop 2026

点击查看摘要

Abstract:Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.

16. 【2606.05042】In-Context Graphical Inference

链接https://arxiv.org/abs/2606.05042

作者:Zehua Cheng,Wei Dai,Jiahao Sun

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Symbolic Computation (cs.SC)

关键词:Belief Propagation, discrete graphical models, graphical models forces, sacrifice convergence guarantees, Marginal inference

备注: 19 Pages

点击查看摘要

Abstract:Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.

17. 【2606.05030】Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

链接https://arxiv.org/abs/2606.05030

作者:Zehua Cheng,Wei Dai,Jiahao Sun,Thomas Lukasiewicz

类目:Computation and Language (cs.CL); Symbolic Computation (cs.SC)

关键词:large language models, Teleological Reasoning Infilling, fundamentally forward-directed, large language, introduce Teleological Reasoning

备注: 25 Pages

点击查看摘要

Abstract:Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

18. 【2606.05029】Validity Threats for Foundation Model Research

链接https://arxiv.org/abs/2606.05029

作者:Gunnar König,Martin Pawelczyk,Ulrike von Luxburg,Sebastian Bordt

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Controlled experiments, machine learning research, prohibitively expensive, backbone of machine, machine learning

备注

点击查看摘要

Abstract:Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

19. 【2606.05016】aDA: Calibrated Probe Gating for Task-Domain LoRA Merging

链接https://arxiv.org/abs/2606.05016

作者:Huy Quoc To,Fuyi Li,Guangyan Huang,Ming Liu

类目:Computation and Language (cs.CL)

关键词:largely unexplored challenge, single unified model, unexplored challenge, single unified, unified model

备注

点击查看摘要

Abstract:Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.

20. 【2606.05014】Depth-Attention: Cross-Layer Value Mixing for Language Models

链接https://arxiv.org/abs/2606.05014

作者:Boyi Zeng,Yiqin Hao,Zitong Wang,Shixiang Song,He Li,Feichen Song,Yifan Liu,Ziwei He,Xinbing Wang,Zhouhan Lin

类目:Computation and Language (cs.CL)

关键词:selects information freely, Self-attention selects information, reuse earlier-layer representations, selectively reuse earlier-layer, residual stream

备注: 21 pages, 4 figures, 9 tables

点击查看摘要

Abstract:Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

21. 【2606.05009】DAR: Deontic Reasoning with Agentic Harnesses

链接https://arxiv.org/abs/2606.05009

作者:Guangyao Dou,William Jurayj,Nils Holzenberger,Benjamin Van Durme

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:computing tax liability, applying explicit rules, Deontic reasoning, case-specific facts, immigration appeal

备注

点击查看摘要

Abstract:Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

22. 【2606.05008】M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

链接https://arxiv.org/abs/2606.05008

作者:Jie Huang,Ruixun Liu,Sirui Sun,Xinyi Yang,Yin Li,Yixin Zhu,Yiwu Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:long-form video understanding, multi-modal models advance, multi-modal models, advance towards long-form, memory

备注: We present an evaluation designed for multi-modal memory in multi-modal models

点击查看摘要

Abstract:As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at this https URL.

23. 【2606.05002】GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

链接https://arxiv.org/abs/2606.05002

作者:Yuxiao Ye,Yiwen Zhang,Huiyuan Xie,Yuqin Huang,Zhiyuan Liu

类目:Computation and Language (cs.CL)

关键词:LLM-based multi-agent systems, systems are increasingly, GARL, strategic, strategic decision-making tasks

备注

点击查看摘要

Abstract:LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.

24. 【2606.04987】DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

链接https://arxiv.org/abs/2606.04987

作者:Xiaochen Zhu,Georgi Karadzhov,Tom Stafford,Andreas Vlachos

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:in-depth complex reasoning, complex reasoning tasks, existing datasets rarely, datasets rarely focus, studying collaborative reasoning

备注

点击查看摘要

Abstract:Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.

25. 【2606.04978】Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

链接https://arxiv.org/abs/2606.04978

作者:Chensong Huang,Changyu Chen,Chenwei Lin,Hanjia Lyu,Xian Xu,Jiebo Luo

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); General Economics (econ.GN)

关键词:cautious-looking outputs, original game, risk decision-making tasks, Petersburg game, risk

备注

点击查看摘要

Abstract:LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.

26. 【2606.04974】SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding

链接https://arxiv.org/abs/2606.04974

作者:Na Li,Chengda Wang,Mingju Gao,Hao Tang

类目:Computation and Language (cs.CL)

关键词:large language models, enable non-autoregressive generation, Diffusion large language, iteratively denoising corrupted, corrupted token sequences

备注: Code: [this https URL](https://github.com/TH-AI-Lab-PKU/SAID)

点击查看摘要

Abstract:Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: this https URL.

27. 【2606.04964】SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

链接https://arxiv.org/abs/2606.04964

作者:Xinrui Song,Zhuoran Wang,Mingju Gao,Hao Tang

类目:Computation and Language (cs.CL)

关键词:Diffusion language models, generate text, iterative denoising, text through iterative, practicality by committing

备注: Code: [this https URL](https://github.com/TH-AI-Lab-PKU/SemBlock)

点击查看摘要

Abstract:Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: this https URL.

28. 【2606.04952】Clinical Assistant for Remote Engagement Link (CARE-link): A Web-Based Electronic Health Records Software for Managing Diabetes

链接https://arxiv.org/abs/2606.04952

作者:Prince Ebenezer Adjei,Joshua Teye Tettey,Toufiq Musah,Audrey Agbeve,John Amuasi

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:LLM-mediated workflow, web-based clinical support, designed to improve, gestational diabetes, diabetes by linking

备注

点击查看摘要

Abstract:CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.

29. 【2606.04928】Data Attribution in Large Language Models via Bidirectional Gradient Optimization

链接https://arxiv.org/abs/2606.04928

作者:Frédéric Berdoz,Luca A. Lanzendörfer,Kaan Bayraktar,Roger Wattenhofer

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, raising critical questions, diverse applications, raising critical

备注: Presented at the AI Governance (AIGOV) Workshop at AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.

30. 【2606.04924】Can Crowdsourcing Survive the LLM Era? A Community Survey on Human Data Collection

链接https://arxiv.org/abs/2606.04924

作者:Aswathy Velutharambath,Neele Falk,Sofie Labat,Tarun Tater,Amelie Wuehrl

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, writing tools challenges, tasks to models

备注

点击查看摘要

Abstract:The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.

31. 【2606.04923】Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

链接https://arxiv.org/abs/2606.04923

作者:Xuekang Wang,Zhuoyuan Hao,Shuo Hou,Hao Peng,Juanzi Li,Xiaozhi Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Rubric-based reinforcement learning, score model outputs, reinforcement learning, reward hacking, hacking

备注: 23 pages, 7 figures

点击查看摘要

Abstract:Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at this https URL.

32. 【2606.04915】Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

链接https://arxiv.org/abs/2606.04915

作者:Zhenyu Yu,Shuigeng Zhou

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, Large language, language models reach, lexical pattern matching, causal reasoning benchmarks

备注

点击查看摘要

Abstract:Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

33. 【2606.04911】BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

链接https://arxiv.org/abs/2606.04911

作者:Yang Liu,Jiajin Zhang,Danyang Tu,Yaojun Hu,Jiao Qu,Jiuyu Zhang,Yu Shi,Wei Fang,Shi Gu,Ling Zhang,Yingda Xia

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:mortality among women, Breast cancer remains, remains a leading, cancer-related mortality, textit

备注

点击查看摘要

Abstract:Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at this https URL.

34. 【2606.04909】BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

链接https://arxiv.org/abs/2606.04909

作者:Yung-Yu Shih,Shang-Yu Su,Tzu-I Ho,Dongzhe Wang,Yun-Nung Chen

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:E-commerce platforms, platforms in emerging, emerging markets, E-commerce, structured attribute schemas

备注: 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: [this https URL](https://doi.org/10.1145/3805712.3808520)

点击查看摘要

Abstract:E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.

35. 【2606.04906】'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions

链接https://arxiv.org/abs/2606.04906

作者:Nils Dycke,Marina Sakharova,Nico Daheim,Iryna Gurevych

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:broad societal risk, text detection literature, AI-generated text poses, AI-generated text detection, societal risk

备注

点击查看摘要

Abstract:Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.

36. 【2606.04889】GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

链接https://arxiv.org/abs/2606.04889

作者:Tej Deep Pala,Vernon Toh,Soujanya Poria

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Reinforcement learning, Language Models, improve mathematical reasoning

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.

37. 【2606.04883】Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

链接https://arxiv.org/abs/2606.04883

作者:Kári Rögnvaldsson,Chenhao Sun,Jasper Dekoninck,Martin Vechev

类目:Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:Large language models, Large language, generating formal proofs, language models, generating formal

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $25.8\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

38. 【2606.04874】Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

链接https://arxiv.org/abs/2606.04874

作者:Haoyu Sun,Wenxuan Wang,Mingyang Song,Jujie He,Weinan Zhang,Yang Liu,Yang Yang,Yu Cheng

类目:Computation and Language (cs.CL)

关键词:central to LLM, LLM agents, reason over constraints, decompose goals, select tools

备注

点击查看摘要

Abstract:Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $\tau^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.

39. 【2606.04847】MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

链接https://arxiv.org/abs/2606.04847

作者:Kun Cheng,Songshuo Lu,Sicong Liao,Tankun Li,Yafei Zhang,Dong Yang,Qiheng Lv,Hua Wang,Zhi Chen,Yaohua Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:efficient low-level code, turns high-level tensor, high-level tensor programs, Native GPU kernel, generation turns high-level

备注

点击查看摘要

Abstract:Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

40. 【2606.04846】Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

链接https://arxiv.org/abs/2606.04846

作者:Lisa Korver,Tomo Lazovich,Sherief Reda

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, raise important questions, Language Models, increasingly popular

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

41. 【2606.04828】A French Corpus Annotated for Multiword Expressions with Adverbial Function

链接https://arxiv.org/abs/2606.04828

作者:Eric Laporte,Takuya Nakamura,Stavroula Voyatzi

类目:Computation and Language (cs.CL)

关键词:presents a French, French corpus annotated, multiword expressions, adverbial function, French corpus

备注

点击查看摘要

Abstract:This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at this http URL under the LGPLLR license.

42. 【2606.04823】R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

链接https://arxiv.org/abs/2606.04823

作者:João Pedro Gandarela,Thiago Rios,Stefan Menzel,André Freitas

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:ensure reliable delivery, Large language models, Large language, open-ended tasks, agentic settings

备注

点击查看摘要

Abstract:Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

43. 【2606.04807】BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

链接https://arxiv.org/abs/2606.04807

作者:Saket Reddy,Ke Yang,ChengXiang Zhai

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:Large Language Models, unlike verifiable tasks, single ground truth, Large Language, Mitigating social bias

备注: Accepted to Findings of the ACL

点击查看摘要

Abstract:Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

44. 【2606.04780】PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents

链接https://arxiv.org/abs/2606.04780

作者:Yubo Hou,Jingwei Song,Hongbo Zhang,Zhisheng Chen,Bang Xiao,Tao Wan,Zengchang Qin

类目:Computation and Language (cs.CL)

关键词:LLM agents require, Persistent LLM agents, long term interaction, LLM agents, Persistent LLM

备注

点击查看摘要

Abstract:Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.

45. 【2606.04778】Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

链接https://arxiv.org/abs/2606.04778

作者:Kyungmin Park,Taesup Kim

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Safety-aligned Large Language, Large Language Models, Safety-aligned Large, Large Language, Language Models

备注

点击查看摘要

Abstract:Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

46. 【2606.04773】NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

链接https://arxiv.org/abs/2606.04773

作者:Yong Cao,Chuqiao Li,Xianghui Xie,Gerard Pons-Moll,Andreas Geiger

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:human motion understanding, Reliable evaluation, human motion, motion understanding, understanding is fundamental

备注: 23 pages, 8 figures, 9 tables

点击查看摘要

Abstract:Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's \kappa=0.70) but break down on fine-grained, part-level judgment (\kappa=0.10), validating the paradigm in its strong regime while clarifying its limits.

47. 【2606.04743】IDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

链接https://arxiv.org/abs/2606.04743

作者:Soyeong Jeong,Jinheon Baek,Minki Kang,Sung Ju Hwang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Agents are widely, assistants over documents, widely deployed, deployed as assistants, broader user context

备注

点击查看摘要

Abstract:Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

48. 【2606.04730】Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

链接https://arxiv.org/abs/2606.04730

作者:Enes Yavuz Ugan,Maike Züfle,Yuka Ko,Supriti Sinhamahapatra,Fabian Retkowski,Seymanur Akti,Jan Niehues,Alexander Waibel

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large Language Models, token-based multi-task models, natural language prompts, advent of Large, target language implicitly

备注: 9 pages main paper, IWSLT 2026 Instruction Following track

点击查看摘要

Abstract:With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

49. 【2606.04719】Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

链接https://arxiv.org/abs/2606.04719

作者:SooHwan Eom,Jay Shim,Gwanhyeong Koo,Haebin Na,Mark A. Hasegawa-Johnson,Sungwoong Kim,Chang D. Yoo

类目:Computation and Language (cs.CL)

关键词:Transformer quadratic complexity, large language models, unsustainable computational load, Transformer quadratic, Structured State-Space Model

备注: Accepted to EMNLP 2024 Findings

点击查看摘要

Abstract:The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

50. 【2606.04703】Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

链接https://arxiv.org/abs/2606.04703

作者:Jingwen Chen,Wenkai Yang,Shengda Fan,Wenbo Nie,Chenxing Sun,Shaodong Zheng,Yangen Hu,Lu Pan,Ke Zeng,Yankai Lin

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, reusable parametric capability, internalization converts contextual, converts contextual experience, Experience internalization converts

备注: 10 pages, 8 figures

点击查看摘要

Abstract:Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

51. 【2606.04701】Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

链接https://arxiv.org/abs/2606.04701

作者:Jiashu Yao,Heyan Huang,Daiqing Wu,Wangke Chen,Huaxi Ai,Haoyu Wen,Zeming Liu,Yuhang Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:agents today assume, GUI agents today, static screen, GUI agents, today assume

备注: preprint

点击查看摘要

Abstract:GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at this https URL.

52. 【2606.04694】DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

链接https://arxiv.org/abs/2606.04694

作者:Patomporn Payoungkhamdee,Tinnakit Udsa,Jian Gang Ngui,Sarana Nutanong,Alham Fikri Aji,Peerat Limkonchotiwat

类目:Computation and Language (cs.CL)

关键词:Small language models, Southeast Asian, capabilities degrade severely, Small language, multilingual capabilities degrade

备注

点击查看摘要

Abstract:Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.

53. 【2606.04691】SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

链接https://arxiv.org/abs/2606.04691

作者:Kenfeng Huang,Yi Cai,Xin Wu,Zikun Deng,Li Yuan

类目:Computation and Language (cs.CL)

关键词:large language models, attracted increasing attention, increasing attention due, Zero-shot information extraction, language models

备注: 21 pages, 9 figures

点击查看摘要

Abstract:Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

54. 【2606.04661】CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

链接https://arxiv.org/abs/2606.04661

作者:Shanu Kumar,Shubhanshu Khandelwal,Akhila Yesantarao Venkata,Parag Agrawal,Yova Kementchedjhieva,Manish Gupta

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:raising inference cost, grow long, raising inference, Prompts tuned, inference cost

备注

点击查看摘要

Abstract:Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.

55. 【2606.04660】LifeSide: Benchmarking Agents as Lifelong Digital Companions

链接https://arxiv.org/abs/2606.04660

作者:Yuqian Wu,Zhijie Deng,Wei Chen,Junwei Li,Yutian Jiang,Junle Chen,Zhengjun Huang,Qingxiang Liu,Jing Tang,Jiaheng Wei,Yuxuan Liang

类目:Computation and Language (cs.CL)

关键词:Lifelong digital companions, integrate cross-session cues, Lifelong digital, shifting privacy boundaries, cross-session cues

备注: 28 pages, 23 figures, 7 tables

点击查看摘要

Abstract:Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textit{Memory-Emotion-Environment} loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

56. 【2606.04646】QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

链接https://arxiv.org/abs/2606.04646

作者:Mengao Zhang,Xiang Yang,Chang Liu,Tianhui Tan,Ke-wei Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:scientific corpora, corpora are natural-language, natural-language versions, versions of database-style, database-style queries

备注: 14 pages

点击查看摘要

Abstract:Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

57. 【2606.04645】CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment

链接https://arxiv.org/abs/2606.04645

作者:Nikodem Tomczak

类目:Computation and Language (cs.CL); Databases (cs.DB)

关键词:graphs generate Cypher, returning wrong results, generate Cypher queries, generate Cypher, knowledge graphs generate

备注

点击查看摘要

Abstract:Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.

58. 【2606.04632】VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation

链接https://arxiv.org/abs/2606.04632

作者:Teqi Hao,Yuxuan Fu,Xiaoyu Tan,Shaojie Shi,Bohao Lv,Yinghui Xu,Xihe Qiu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Respiratory Distress Syndrome, Acute Respiratory Distress, Distress Syndrome, requires balancing competing, Acute Respiratory

备注

点击查看摘要

Abstract:Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.

59. 【2606.04628】RAMPART: Registry-based Agentic Memory with Priority-Aware Runtime Transformation

链接https://arxiv.org/abs/2606.04628

作者:Nikodem Tomczak

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:pure in-RAM block, LLM-based agents, in-RAM block registry, pure in-RAM, RAMPART

备注

点击查看摘要

Abstract:RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.

60. 【2606.04612】Hybrid Adversarial Defence for Natural Language Understanding Tasks

链接https://arxiv.org/abs/2606.04612

作者:Manar Abouzaid,Yang Wang,Chenghua Lin,Stuart E. Middleton

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Natural Language Understanding, Language Understanding datasets, Large

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.

61. 【2606.04596】A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

链接https://arxiv.org/abs/2606.04596

作者:Huangchen Xu,Yuan Wu,Yi Chang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, remains poorly understood

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.

62. 【2606.04591】Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

链接https://arxiv.org/abs/2606.04591

作者:Hanbo Bi,Zhiqiang Yuan,Chongyang Li,Qiwei Yan,Zexi Jia,Jiapei Zhang,Xiaoyue Duan,Yingchao Feng,Jinchao Zhang,Jie Zhou

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-modal communication platforms, dialogues interleaving text, communication platforms, increasingly common, widespread adoption

备注

点击查看摘要

Abstract:With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

63. 【2606.04588】VCIFBench: Evaluating Complex Instruction Following for Video Understanding

链接https://arxiv.org/abs/2606.04588

作者:Huangchen Xu,Yuan Wu,Yi Chang

类目:Computation and Language (cs.CL)

关键词:Multimodal large language, made rapid progress, provide limited evidence, Multimodal large, existing benchmarks largely

备注

点击查看摘要

Abstract:Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.

64. 【2606.04557】Cartridges at Scale: Training Modular KV Caches over Large Document Collections

链接https://arxiv.org/abs/2606.04557

作者:Momchil Hardalov,Gonzalo Iglesias,Adrià de Gispert

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, content remains static, Models can reason

备注: 21 pages, 5 figures, 17 tables

点击查看摘要

Abstract:Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

65. 【2606.04555】mporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

链接https://arxiv.org/abs/2606.04555

作者:Yifan Simon Liu,Liam Gallagher,Faeze Moradi Kalarde,Jiazhou Liang,Armin Toroghi,Scott Sanner

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Long-horizon conversational agents, conversational agents, interact with users, users through evolving, Segment Tree

备注

点击查看摘要

Abstract:Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

66. 【2606.04552】LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

链接https://arxiv.org/abs/2606.04552

作者:Daria Ledneva,Denis Kuznetsov

类目:Computation and Language (cs.CL); Genomics (q-bio.GN)

关键词:biologically relevant structure, increasingly adopt large, impose arbitrary sequence, obscure biologically relevant, adopt large language

备注

点击查看摘要

Abstract:Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-based regularizer to induce adaptive token boundaries without supervision. Fine-tuned on 27 tasks from the Nucleotide Transformer and Genomic Benchmarks suites, LDARNet achieves 11/18 wins among compact models ($$300M parameters) and state-of-the-art results on 5 histone modification tasks, outperforming models up to 20$\times$ larger. A FLOPs-matched controlled experiment isolates learned routing as the source of these gains: learned boundaries beat fixed-grid boundaries by up to 14 percentage points on histone tasks at identical compute. Nucleotide-resolution analysis further shows that the learned boundaries align with canonical promoter motifs and splice junctions without supervision, providing a biological interpretation for adaptive tokenization in genomic foundation models.

67. 【2606.04547】Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

链接https://arxiv.org/abs/2606.04547

作者:Heng Cao,Fan Zhang,Jian Yao,Yujie Zheng,Changlin Zhao,Lu Hao,Yuxuan Wei,Wangze Ni,Huaiyu Fu,Yuqian Sun,Xuyan Mo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Personalizing large language, large language models, language models requires, models requires adapting, requires adapting model

备注: 16 pages, 6 figures

点击查看摘要

Abstract:Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

68. 【2606.04535】Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

链接https://arxiv.org/abs/2606.04535

作者:Boyan Han,Yiwei Wang,Yi Song,Yujun Cai,Chi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Diffusion large language, large language models, offer bidirectional attention, exploit global context, naturally support format-constrained

备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and naturally support format-constrained tasks like parseable JSON or reasoning templates. While straightforward fixed anchors can enforce such constraints, they often impose rigid spans, leading to truncated reasoning or redundant content. To overcome this, we propose Dynamic Infilling Anchors (DIA), a training-free method that dynamically estimates end-anchor positions to adjust generation length before iterative infilling. This flexible mechanism ensures structural correctness and semantic coherence, avoiding the inefficiencies of fixed-span methods. Experiments on reasoning benchmarks demonstrate that DIA substantially improves format compliance and answer accuracy, achieving significant zero-shot gains on GSM8K and MATH. These results establish DIA as a robust pathway toward reliable, structure-aware generation.

69. 【2606.04525】GENEB: Why Genomic Models Are Hard to Compare

链接https://arxiv.org/abs/2606.04525

作者:Daria Ledneva,Mikhail Nuridinov,Denis Kuznetsov

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)

关键词:incompatible evaluation protocols, genomic foundation models, task-specific reporting, difficult to assess, assess due

备注

点击查看摘要

Abstract:Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

70. 【2606.04511】SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

链接https://arxiv.org/abs/2606.04511

作者:Yaosheng Fu,Guangxuan Xiao,Xin Dong,Song Han,Oreste Villa

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:long-context LLM inference, LLM inference, long-context LLM, attention reduces compute, Sparse attention reduces

备注

点击查看摘要

Abstract:Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains $O(T^2)$ complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds $$0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25$\times$ prefill speedup and 1.7$\times$ decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3$\times$ higher decode throughput than the non-offload sparse baseline. Our source code is available at this https URL.

71. 【2606.04507】Self-Evolving Deep Research via Joint Generation and Evaluation

链接https://arxiv.org/abs/2606.04507

作者:Han Zhu,Chengkun Cai,Yuanfeng Song,Xing Chen,Sirui Han,Yike Guo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, deep research standing, daily applications

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

72. 【2606.04500】SANE Schema-aware Natural-language Evaluation of Biological Data

链接https://arxiv.org/abs/2606.04500

作者:Rolf Gattung,Martin Krueger,Markus Reischl

类目:Computation and Language (cs.CL)

关键词:High-throughput microscopy generates, datasets capturing cellular, datasets typically requires, capturing cellular responses, requires SQL expertise

备注: 5 pages, 3 figures, submitted but not yet reviewed by BMT2026

点击查看摘要

Abstract:High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting.

Comments:
5 pages, 3 figures, submitted but not yet reviewed by BMT2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2606.04500 [cs.CL]

(or
arXiv:2606.04500v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.04500

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2606.04486】Global Sketch-Based Watermarking for Diffusion Language Models

链接https://arxiv.org/abs/2606.04486

作者:Daniel Zhao

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:autoregressive setting, generated sequentially, diffusion language models, studied extensively, language models

备注

点击查看摘要

Abstract:Watermarking methods for language models have been studied extensively in the autoregressive setting, where tokens are generated sequentially. These works largely focus on local-context schemes that perturb the next token's distribution as a function of its preceding tokens. In diffusion language models, distributions over many unresolved positions are jointly sampled, allowing additive statistics of the entire sequence to be tractable during generation. We propose a watermark for masked diffusion language models that controls a global, vector-valued sketch representation of the text. Compared to context-dependent watermarking, the sketch formulation decouples detection from the local contexts seen during generation, resulting in an order-agnostic statistic and a watermarking rule which does not manifest as a simple token bias. We analyze the distortion, soundness, and robustness properties of the method.

74. 【2606.04483】Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

链接https://arxiv.org/abs/2606.04483

作者:Zhongze Luo,Ruihe Shi,Zhenshuai Yin,Haoyue Liu,Weixuan Wan,Xiaoying Tang

类目:Computation and Language (cs.CL)

关键词:fingerprint and patch, discrete artifacts, artifacts whose surface, surface forms, forms are easy

备注: 23 pages

点击查看摘要

Abstract:Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.

75. 【2606.04479】Evaluating Reasoning Fidelity in Visual Text Generation

链接https://arxiv.org/abs/2606.04479

作者:Jiajun Hong,Jiawei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enabling applications including, render highly legible, applications including document, including document generation, enabling applications

备注: Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

点击查看摘要

Abstract:Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

76. 【2606.04474】Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

链接https://arxiv.org/abs/2606.04474

作者:Ming-Hao Hsu,Xiaohai Tian,Jun Zhang,Zhizheng Wu

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Large Language Models, Speech Large Language, Large Language, Speech Large, Language Models

备注

点击查看摘要

Abstract:Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.

77. 【2606.04466】Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

链接https://arxiv.org/abs/2606.04466

作者:Chongyang He,Rui Zhang,Zixuan Wang,Xin Li

类目:Computation and Language (cs.CL)

关键词:Small Language Models, Post-training Small Language, Small Language, existing work rarely, Language Models

备注: 25 pages, 12 figures

点击查看摘要

Abstract:Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.

78. 【2606.04465】SePO: Self-Evolving Prompt Agent for System Prompt Optimization

链接https://arxiv.org/abs/2606.04465

作者:Wangcheng Tao,Han Wu,Weng-Fai Wong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:agents' system prompts, task agents' system, System prompt, system prompts, prompt agent

备注: 26 pages. Code: [this https URL](https://github.com/taowangcheng/SePO)

点击查看摘要

Abstract:System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

79. 【2606.04459】oken Rankings are Unforgeable Language Model Signatures

链接https://arxiv.org/abs/2606.04459

作者:Matthew Finlayson,Andreas Grivas,Xiang Ren,Swabha Swayamdipta

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)

关键词:Language model parameters, API distributes logits, Language model, geometric constraints, model

备注

点击查看摘要

Abstract:Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signature that identifies the model, but also leaks the model's final layer parameters when an API distributes logits. We investigate more restrictive APIs that expose token rankings (i.e., their ordering by probability, but not the probability values) and find that rankings also constitute a signature: every model has a unique set of feasible top-$k$ rankings for sufficiently large $k$. Furthermore, the ranking signature is the first known (polynomially) unforgeable signature, since finding a model with the same set of feasible rankings is NP-hard. On the security front, we find that token rankings are already sufficient to approximately steal the final layer of the model, similar to logits, though the approximation is too coarse to forge the signature, and can be effectively countered by restricting the API to top-$k$ tokens with sufficiently small $k$. Since the top-$k$ required to present the model signature is generally smaller than the $k$ required to prevent stealing, it is possible for an API to present an unforgeable signature without leaking model parameters.

80. 【2606.04455】he Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

链接https://arxiv.org/abs/2606.04455

作者:Xinyu Lu,Tianshu Wang,Pengbo Wang,zujie wen,Zhiqiang Zhang,Jun Zhou,Boxi Cao,Yaojie Lu,Hongyu Lin,Xianpei Han,Le Sun

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:human-designed workflows, task execution, execution within human-designed, Current, Current AI benchmarks

备注: Website: [this https URL](https://meta-agent-challenge.com/)

点击查看摘要

Abstract:Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: this https URL.

81. 【2606.04454】Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

链接https://arxiv.org/abs/2606.04454

作者:Xin Zhang,Yang Cao,Baoxing Wu,Kai Song,Siying Li

类目:Computation and Language (cs.CL)

关键词:Large language models, natural language generation, downstream reasoning tasks, Large language, shown strong performance

备注

点击查看摘要

Abstract:Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.

82. 【2606.04450】Listening to the Workforce: Measuring Construction Worker Safety Attitudes from Social Media Discourse Using LLMs

链接https://arxiv.org/abs/2606.04450

作者:Farouq Sammour,Yuxin Zhang,Zhenyu Zhang

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:safety attitudes, key determinants, Worker safety attitudes, Construction Safety Attitude, attitudes

备注

点击查看摘要

Abstract:Worker safety attitudes are key determinants of whether protective practices are applied or bypassed on construction sites. Yet measuring them at scale has remained out of reach. Safety attitudes are multidimensional, vary across topics, and surface most candidly in workers' own conversations. This study created and validated the Construction Safety Attitude Framework (CSAF), which integrates two components: a theory-grounded structure that characterizes safety attitudes along eight dimensions, and an operational codebook for measuring them in worker naturalistic discourse. Applying CSAF to 250 posts and comments from the r/Construction community on Reddit, trained coders reached strong agreement (Krippendorff's {\alpha} = 0.85). Pairwise lift and conditional probability confirmed that the eight dimensions are related yet distinct. To apply the framework across large volumes of discourse, CSAF was operationalized through a large language model (LLM) classifier. On 450 r/Construction contributions, the classifier reproduced expert human coding (Cohen's \k{appa} = 0.90, precision = 0.98, recall = 0.98), and on 400 contributions from r/Roofing it retained that accuracy after transfer to a different trade community (\k{appa} = 0.89, precision = 0.98, recall = 0.97). A proof-of-value case study then applied the validated classifier to 10,346 contributions from r/Roofing, demonstrating that CSAF can distinguish multidimensional attitudes by safety topic, track how they shift over time, and trace the reasoning behind unfavorable ones. The study therefore provides a theoretically grounded, empirically vetted instrument for examining safety attitudes, offering a basis for targeted interventions that address the attitudes underlying unsafe practices.

83. 【2606.04442】MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

链接https://arxiv.org/abs/2606.04442

作者:Qiyang Xie,Jialun Wu,Xinjie He,Su Liu,Shuai Xiao,Zhiyuan Lin,Weikai Zhou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:performing deep reading, deep reading comprehension, Caselaw Access Project, navigating multi-session conversation, demanding capabilities

备注: 17 pages, 2 figures, 8 tables. Submitted for peer review

点击查看摘要

Abstract:AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep reading comprehension within long documents. Yet no existing benchmark evaluates both simultaneously. We introduce MemoryDocDataSet, a synthetic benchmark of 50 micro-worlds and 1,000 QA pairs in which each instance comprises 3-5 personas, a temporal event graph spanning months of activity, 3-5 real long documents (20,000-50,000 tokens each sourced from the Caselaw Access Project), multi-session conversations grounded on those documents, and 20 question-answer pairs across five reasoning categories. The defining feature is the Hybrid source tag: questions requiring a system to first navigate conversation history to identify which document is relevant, then extract the answer from within that document. Hybrid questions account for 75.1% of the dataset. Dataset quality is characterised through a prompt-sensitivity self-consistency analysis using LLM-as-judge, yielding a median Cohen's $\kappa = 0.634$ across all 50 micro-worlds. We evaluate six baseline configurations spanning truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The best baseline (RAG-Both) achieves 0.358 overall F1 and 0.342 on Hybrid. Document-only retrieval (RAG-Doc) collapses to 0.267 on Hybrid despite achieving 0.453 on Doc-only questions, demonstrating a clear joint-retrieval gap that motivates architectures unifying conversational memory with long-document navigation. We release the dataset, generation pipeline, and all baseline implementations.

84. 【2606.04435】Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

链接https://arxiv.org/abs/2606.04435

作者:Saroj Mishra

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:mechanisms systematically miss, incorrect final outputs, demonstrated significant capability, factually incorrect final, Cascading Hallucination Aware

备注

点击查看摘要

Abstract:Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

85. 【2606.04433】Stateful Visual Encoders for Vision-Language Models

链接https://arxiv.org/abs/2606.04433

作者:Zirui Wang,Junwei Yu,Adam Yala,David M. Chan,Joseph E. Gonzalez,Trevor Darrell

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:multi-turn agentic settings, Vision-language models, multi-turn agentic, visual, agentic settings

备注: Project page: [this https URL](https://statefulvisualencoders.github.io/)

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: this https URL

86. 【2606.04418】CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

链接https://arxiv.org/abs/2606.04418

作者:Eugene Kwek,Feng Liu,Rui Zhang,Wenpeng Yin

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Neural audio codecs, speech processing pipelines, Neural audio, processing pipelines, key component

备注

点击查看摘要

Abstract:Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.

87. 【2606.04396】Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

链接https://arxiv.org/abs/2606.04396

作者:Anant Khandelwal,Manish Gupta

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, large language models, Diffusion large, language models, positions in parallel

备注: 19 pages, 10 figures, 7 Tables

点击查看摘要

Abstract:Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.

88. 【2606.04392】Physics-Informed Neural Network Modeling of Biodegradable Contaminant Transport through GCL/SL Composite Liners

链接https://arxiv.org/abs/2606.04392

作者:Dong Li,Yapeng Cao,Haiping Zhao,Shutong Han

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:thin GCL layer, composite liner system, underlying soil liner, two-domain physics-informed neural, transient transport domain

备注

点击查看摘要

Abstract:This study develops a two-domain physics-informed neural network framework for contaminant transport through a GCL/SL composite liner system, in which the thin GCL layer is treated using a steady-state advection-dispersion-biodegradation formulation and the underlying soil liner is modeled as a transient transport domain. Two formulations are evaluated against analytical and finite-element reference solutions under different leachate-head conditions: a standard PINN with soft constraint enforcement (Std-PINN) and a hard-constrained PINN (H-PINN), in which selected boundary and initial conditions are embedded directly into the trial solutions. The Std-PINN captures the overall breakthrough behavior but shows larger errors during the early transport stage, particularly under higher leachate heads where advective transport becomes more pronounced. The H-PINN reduces the optimization burden associated with penalty-based constraint enforcement and provides more accurate and stable concentration predictions, lowering the MAE from approximately 0.058-0.067 for the Std-PINN to about 0.011-0.023 for the H-PINN, while reducing the MRE from approximately 9.10%-19.16% to about 2.08%-3.14%. Parametric analyses confirm that the H-PINN with the tanh activation function and an optimized network structure provides the best predictive accuracy. The H-PINN is further extended to inverse modeling for identifying the SL degradation half-life from limited concentration observations, showing reliable convergence toward prescribed values and acceptable robustness under low-to-moderate observation noise.

89. 【2606.04389】When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

链接https://arxiv.org/abs/2606.04389

作者:Yihao Qin,Junyi Zhao,Changsheng Ma,Yongfeng Tao,Minqiang Yang,Chang Liu,Bin Hu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, existing benchmarks rely, benchmarks rely heavily, highly cooperative simulated

备注

点击查看摘要

Abstract:Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.

90. 【2606.04378】DLLG: Dynamic Logit-Level Gating of LLM Experts

链接https://arxiv.org/abs/2606.04378

作者:Bingnan Li,Zhaoyang Zhang,Xiaoze Liu,Yantao Shen,Shuli Jiang,Shuo Yang,Wei Xia,Zhuowen Tu,Stefano Soatto

类目:Computation and Language (cs.CL)

关键词:combine complementary strengths, merging introduces interference, Leveraging multiple specialized, existing approaches trade, approaches trade adaptability

备注

点击查看摘要

Abstract:Leveraging multiple specialized LLMs can combine complementary strengths, but existing approaches trade adaptability for stability: routing commits prematurely, heuristic ensembling depends on fragile proxies, and parameter merging introduces interference. We propose DLLG (Dynamic Logit-Level Gating), a dynamic logit-level ensembling framework that learns token-level expert fusion from sparse response-level supervision. A lightweight gating module predicts step-wise fusion weights, linking trajectory-level correctness to generation without token-level labels or expert retraining. Across diverse reasoning and code benchmarks, DLLG consistently outperforms strong routing, heuristic ensembling, and parameter-merging baselines across model scales, highlighting learned logit-level fusion as a robust and scalable paradigm for integrating specialized experts.

91. 【2606.04367】GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings

链接https://arxiv.org/abs/2606.04367

作者:Bhargav Shandilya,Matt Buchholz,Alexis Palmer

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Interlinear glossed text, Interlinear glossed, glossed text, language documentation, standard format

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.

92. 【2606.04362】Disentangling Answer Engine Optimization from Platform Growth: A Log-Based Natural Experiment on ChatGPT Referral Traffic

链接https://arxiv.org/abs/2606.04362

作者:Keisuke Watanabe,Kazuki Nakayashiki

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Answer Engine Optimization, search engine optimization, called Answer Engine, engine optimization, send measurable referral

备注: 9 pages, 4 figures, 1 table

点击查看摘要

Abstract:Large language model (LLM) "answer engines" such as ChatGPT now send measurable referral traffic to the open web, and a practice analogous to search engine optimization, here called Answer Engine Optimization (AEO), has emerged. Public AEO success stories typically quote large raw growth multiples, but raw referral growth is confounded by the rapid platform-level growth of the answer engines themselves. We report a longitudinal field study on a single high-traffic domain (this http URL) whose corpus of hundreds of thousands of YouTube question-and-answer pages received a defined bundle of AEO interventions in January 2026 (detailed in Section 4). Because the interventions were concentrated on one subset of the site, the untreated remainder of the same domain acts as a contemporaneous control that absorbs the platform tailwind. Using first-party analytics and server logs rather than probabilistic third-party estimators, we find: (1) raw growth is dominated by the platform tailwind: on monthly aggregates total ChatGPT referrals grew 5.7x while untreated pages on the same domain grew 3.5x over the same window; (2) an interrupted time-series model on the weekly treated/control ratio estimates a discrete, intervention-aligned level increase of 1.82x (95% CI 1.31-2.54, HAC p=0.001), robust across engagement-filtered traffic (2.27x) and alternative specifications; (3) however, a conservative placebo-in-time permutation test yields p=0.16, so the effect is suggestive, not conclusive, given a short and noisy pre-period; and (4) Google organic clicks to treated pages did not fall beyond the ambient site-wide trend and indexation was preserved, consistent with the SEO-protection rule. The methodological message, separating treatment from platform tailwind with an on-domain control, matters more than any single multiple, and implies that headline AEO multiples substantially overstate causal effect.

93. 【2606.04360】Deliberate Evolution: Agentic Reasoning for Sample-Efficient Symbolic Regression with LLMs

链接https://arxiv.org/abs/2606.04360

作者:Xinyu Pang,Zhanke Zhou,Xuan Li,Fangrui Lv,Shanshan Wei,Sen Cui,Bo Han,Changshui Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:discovers compact mathematical, compact mathematical expressions, evolutionary methods remain, methods remain sample-inefficient, discovers compact

备注: ICML 2026

点击查看摘要

Abstract:Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they rely mainly on scalar feedback such as MSE. We identify a core limitation: existing methods conflate candidate proposal with search guidance, requiring the LLM to infer how to evolve an expression, diagnose its errors, and reuse past experience from a single score. To address this, we propose Deliberate Evolution (DE), an agentic framework that decouples symbolic generation from search control. DE guides LLM proposals with adaptive operators for search direction, analytical tools for structural diagnosis, and reflective memory for trajectory-level experience. Experiments on LLM-SRBench show that DE consistently outperforms representative LLM-based SR baselines across diverse scientific domains while using only 40% of the standard sample budget.

94. 【2606.04351】Video2LoRA: Parametric Video Internalization for Vision-Language Models

链接https://arxiv.org/abs/2606.04351

作者:Manan Suri,Sarvesh Baskar,Dinesh Manocha

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:frame occupies hundreds, Processing video, occupies hundreds, Processing, video

备注

点击查看摘要

Abstract:Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

95. 【2606.04340】Noisy memory encoding explains negative polarity illusions

链接https://arxiv.org/abs/2606.04340

作者:Yuhan Zhang,Edward Gibson

类目:Computation and Language (cs.CL)

关键词:negative polarity word, strictly speaking, negative polarity, rated as acceptable, polarity word

备注: 21 pages, 5 figures, submitted for journal publication

点击查看摘要

Abstract:A sentence like "The authors that no critics recommended have ever received acknowledgment for a best-selling novel" is sometimes rated as acceptable even though, strictly speaking, it is ungrammatical because the negative polarity word "ever" is not licensed where it is. This behavioral effect is sometimes called a "negative polarity illusion". Here we propose that the lossy context surprisal theory of Hahn et al. (2022) -- whereby people have an imperfect encoding of complex sentences -- might explain this effect. We hypothesize that people have poor memory representation of the determiners in the main-clause and embedded-clause subjects and could entertain a determiner exchange that licenses ever. We propose that more similar determiners in those positions would trigger stronger illusion effects. Acceptability judgment tasks with six novel determiner pairs (e.g., "few" and "many", "few" and "most") support our proposal, showing, specifically, that a novel sentence, "Many authors that few critics recommended have ever received acknowledgment for a best-selling novel", triggered a much stronger illusion than the canonical one even without time pressure. These results offer further support for the suggestion that human language processing is imperfect and resource-rational: in face of working memory limitations, humans rationally reconstruct what is most likely from noisy linguistic input to facilitate downstream processing.

96. 【2606.04325】Parameter-Efficient Fine-Tuning with Learnable Rank

链接https://arxiv.org/abs/2606.04325

作者:Arpit Garg,Simon Lucey,Hemanth Saratchandran

类目:Computation and Language (cs.CL)

关键词:restricts weight updates, fixed low-rank inductive, popular parameter-efficient fine-tuning, low-rank inductive bias, effective inductive bias

备注: In Submission

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a popular parameter-efficient fine-tuning (PEFT) method that restricts weight updates to low-rank adapters, introducing a fixed low-rank inductive bias by optimizing in a low-dimensional subspace. In this work, we question whether a fixed-rank constraint is the most effective inductive bias for parameter-efficient fine-tuning. We introduce *Learnable Rank LoRA (LR-LoRA)*, a PEFT method in which the adapter rank is learned during the training process. Instead of prescribing a uniform rank for all adapter layers, LR-LoRA allows the optimizer to determine the appropriate rank for each layer. Using this approach, we find substantial layer-wise variation in the learned ranks, with the attention and MLP layers in the transformer models exhibiting systematically different rank preferences. Across a range of language understanding and commonsense reasoning benchmarks, LR-LoRA achieves state-of-the-art performance in most settings and consistently outperforms strong PEFT baselines, demonstrating that a learnable rank provides a more flexible and effective inductive bias than fixed-rank adaptations.

97. 【2606.04302】LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

链接https://arxiv.org/abs/2606.04302

作者:Haocheng Xia,Mihir Pamnani,Hanxi Fang,Supawit Chockchowwat,Yongjoo Park

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, reusing past computations, language models, generated tokens, large language

备注: ICML 2026

点击查看摘要

Abstract:Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation (RAG) and in-context learning (ICL). However, conventional KV caching embeds positional information directly into the cache, limiting its reusability. Existing solutions either restrict reuse to prefixes or require expensive memory materialization for positional re-encoding. We introduce LazyAttention, a novel attention mechanism that kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV reuse. By adjusting positional encoding within attention kernels on-the-fly, LazyAttention resolves the materialization bottleneck, allowing a single physical KV copy to serve multiple logical requests at arbitrary positions. Leveraging attention kernels tailored for prefilling and decoding, our system achieves significant efficiency improvements: under skewed document distributions, it reduces time-to-first-token (TTFT) by 1.37$\times$ and increases inference throughput by 1.40$\times$ compared to the state-of-the-art Block-Attention, while maintaining comparable output quality.

98. 【2606.04286】Using Text-Based Causal Inference to Disentangle Factors Influencing Online Review Ratings

链接https://arxiv.org/abs/2606.04286

作者:Linsen Li,Aron Culotta,Nicholas Mattei

类目:Computation and Language (cs.CL)

关键词:Online reviews provide, provide valuable insights, reviews provide valuable, product or service, Online reviews

备注: HLT/NAACL 2025

点击查看摘要

Abstract:Online reviews provide valuable insights into the perceived quality of facets of a product or service. While aspect-based sentiment analysis has focused on extracting these facets from reviews, there is less work understanding the impact of each aspect on overall perception. This is particularly challenging given correlations among aspects, making it difficult to isolate the effects of each. This paper introduces a methodology based on recent advances in text-based causal analysis, specifically CausalBERT, to disentangle the effect of each factor on overall review ratings. We enhance CausalBERT with three key improvements: temperature scaling for better calibrated treatment assignment estimates; hyperparameter optimization to reduce confound overadjustment; and interpretability methods to characterize discovered confounds. In this work, we treat the textual mentions in reviews as proxies for real-world attributes. We validate our approach on real and semi-synthetic data from over 600K reviews of U.S. K-12 schools. We find that the proposed enhancements result in more reliable estimates, and that perception of school administration and performance on benchmarks are significant drivers of overall school ratings.

99. 【2606.04284】Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

链接https://arxiv.org/abs/2606.04284

作者:Yifan Wang,Jinyi Mu,Mayank Jobanputra,Yu Wang,Ji-Ung Lee,Soyoung Oh,Isabel Valera,Vera Demberg

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enabling large language, Preference modeling plays, large language models, enabling large, modeling plays

备注

点击查看摘要

Abstract:Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.

100. 【2606.04274】Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

链接https://arxiv.org/abs/2606.04274

作者:JooYoung Lee,Lin Tian,Angela Brillantes,Adriana-Simona Mihăiţă,Marian-Andrei Rizoiu

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:online information verification, Gemini Flash Lite, Claude Sonnet, default tools, tools for online

备注

点击查看摘要

Abstract:As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.

Subjects:

Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2606.04274 [cs.CL]

(or
arXiv:2606.04274v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2606.04274

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
101. 【2606.04262】Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

链接https://arxiv.org/abs/2606.04262

作者:Maroof Kousar,Yibo Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:everyday health questions, Large language models, Large language, health questions, everyday health

备注: 16 pages, 7 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.

102. 【2606.04261】Can Generalist Agents Automate Data Curation?

链接https://arxiv.org/abs/2606.04261

作者:Feiyang Kang,Hanze Li,Adam Nguyen,Mahavir Dabas,Jiaqi W. Ma,Frederic Sala,Dawn Song,Ruoxi Jia

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:practitioners iteratively propose, Curating training data, noisy benchmark feedback, Curating training, modern AI development

备注: Preprint

点击查看摘要

Abstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

103. 【2606.04246】StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

链接https://arxiv.org/abs/2606.04246

作者:Prashanth Vijayaraghavan,Apoorva Nitsure,Luyao Shi,Ehsan Degan,Vandana Mukherjee

类目:Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL)

关键词:remains challenging due, strict correctness constraints, Automatic generation, designs remains challenging, RTL code generation

备注: 6 pages, 2 figures, DAC'2026

点击查看摘要

Abstract:Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, and strict correctness constraints in Verilog and VHDL. We present StepPRM-RTL, a novel framework that combines stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT) to enhance both the functional correctness and reasoning fidelity of LLM-based RTL code generation. StepPRM-RTL constructs stepwise reasoning trajectories from canonical solutions, where each step contains a rationale and incremental code modification. A Process Reward Model (PRM) evaluates intermediate steps, providing dense feedback that guides reinforcement-style updates during RAFT fine-tuning. Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, enriching the training dataset with high-quality trajectories. This integration of stepwise and outcome-aware rewards allows the model to learn both how and why to construct correct RTL, improving long-horizon reasoning beyond standard supervised or outcome-based training. Experimental evaluation on benchmark Verilog and VHDL datasets demonstrates that StepPRM-RTL outperforms the best prior methods by over 10\% in functional correctness and reasoning fidelity metrics. Ablation studies confirm that the combination of PRM-guided rewards and stepwise trajectory exploration is key to its performance. StepPRM-RTL generalizes across RTL languages and provides a scalable framework for high-fidelity, interpretable code generation, establishing a new standard for LLM-assisted hardware design automation.

104. 【2606.04244】VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

链接https://arxiv.org/abs/2606.04244

作者:Amirhossein Dabiriaghdam,Shayan Vassef,Mohammadreza Bakhtiari,Yasamin Medghalchi,Ilker Hacihaliloglu,Mesrob Ohannessian,Lele Wang,Giuseppe Carenini

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal large language, large language models, large language, increasingly capable, capable of complex

备注

点击查看摘要

Abstract:Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

105. 【2606.04240】Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

链接https://arxiv.org/abs/2606.04240

作者:Jingbiao Mei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Document Retrieval, multimodal retrieval-augmented generation, document page retrieval, Document Retrieval Challenge, closed-set document page

备注: MDR Challenge Report at WWW2025

点击查看摘要

Abstract:Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

106. 【2606.04236】Supportive Token Revealing for Fast Diffusion Language Model Decoding

链接https://arxiv.org/abs/2606.04236

作者:Giries Abu Ayoub,Mario Barbara,Lluís Pastor-Pérez,Tanja Bien,Aneesh Barthakur,Alaa Maalouf,Loay Mualem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:generate text efficiently, Discrete diffusion language, Discrete diffusion, generate text, text efficiently

备注

点击查看摘要

Abstract:Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism introduces a quality-latency trade-off. Aggressive decoding may commit mutually dependent tokens too early, while conservative decoding requires many denoising steps. Existing methods address this tension by deciding which tokens are safe to reveal using confidence or dependency criteria. However, avoiding unsafe commits does not necessarily make the remaining masked sequence easy to decode, since uncertain tokens may depend on masked tokens, creating a bottleneck for denoising steps. We propose AXON, a training-free module that can be added on top of existing parallel decoding strategies for diffusion language models. Rather than replacing the base decoder, AXON monitors the remaining uncertain masked tokens and intervenes only when their current state suggests that additional context is needed. It then shifts the criterion from which tokens are safest to reveal to which confident reveals would best support later denoising. AXON selects anchors, confident masked tokens that uncertain positions attend to, using attention, uncertainty, and confidence signals. Experiments on reasoning and code-generation benchmarks across multiple diffusion language models show that AXON improves the quality-latency trade-off of existing parallel decoders, often reducing the number of function evaluations while maintaining or improving accuracy.

107. 【2606.04231】MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise QA

链接https://arxiv.org/abs/2606.04231

作者:Hanoz Bhathena,Parin Rajesh Jhaveri,Rohan Mittal,Prateek Singh,Aymen Kallala,Rachneet Kaur,Yiqiao Jin,Zhen Zeng,Adwait Ratnaparkhi,Denis Kochedykov

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Recent advances, producing retriever embeddings, shifted toward minimal, images for producing, producing retriever

备注: Accepted at ACL 2026 (Industry Track)

点击查看摘要

Abstract:Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects explicit handling of the rich, structured information in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure via a document structure-aware split that dynamically routes documents through orientation-specific ingestion pipelines, applying explicit layout-aware parsing for vertically structured documents (e.g., reports) and holistic page-level representations for horizontally structured documents (e.g., slide decks). A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context, enabling richer, more grounded answers without any finetuning requirement. Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment.

108. 【2606.04205】DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

链接https://arxiv.org/abs/2606.04205

作者:Sajad Ebrahimi,Nima Jamali,Bardia Shirsalimian,Kelly McConvey,Wentao Zhang,Jalehsadat Mahdavimoghaddam,Maksym Taranukhin,Maura Grossman,Vered Shwartz,Yuntian Deng,Ebrahim Bagheri

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

关键词:growing popularity, growing body, popularity and capacity, capacity of generative, eroded the distinction

备注

点击查看摘要

Abstract:The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at this https URL, and the package can be installed via pip install detectzoo.

109. 【2606.04199】Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features

链接https://arxiv.org/abs/2606.04199

作者:Aya Vera-Jimenez,Samuel Jaeger,Calvin Ibenye,Dhrubajyoti Ghosh

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, large language, raised concerns, varying prompting strategies, AI-generated

备注

点击查看摘要

Abstract:The increasing use of large language models has raised concerns about the spread of AI-generated fake news, particularly under varying prompting strategies. Most existing detection models are trained and evaluated under a single generation setting, leaving their ability to generalize across unseen prompts unclear. In this study, we investigate cross-prompt generalization in fake news detection using three datasets of AI-generated articles produced under distinct prompts, combined with real news articles. We extract interpretable linguistic features capturing lexical diversity, readability, and emotion-based characteristics and evaluate a random forest classifier under a cross-prompt framework, where models trained on one prompt are tested on another. Across all six train-test combinations, performance remains consistently high, with AUC values ranging from 0.988 to 1.000. Analysis of feature distributions shows that AI-generated text exhibits increased lexical diversity, reduced readability, and substantially lower emotional intensity compared to the overall dataset, with variations across prompts. Despite these distributional shifts, the classifier maintains strong performance, indicating that these features capture stable properties of AI-generated text that generalize across prompting strategies. These findings suggest that feature-based approaches can provide robust detection of AI-generated fake news under prompt variability.

110. 【2606.04197】Exploring the Topology and Memory of Consensus: How LLM Agents Agree, Fragment, or Settle When Forming Conventions

链接https://arxiv.org/abs/2606.04197

作者:Aliakbar Mehdizadeh,Martin Hilbert

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)

关键词:LLM agent remember, LLM agent, networked Naming Game, LLM, multi-agent systems

备注: Submitted to the Journal of Artificial Societies and Social Simulation (JASSS)

点击查看摘要

Abstract:How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game on eight fixed 16-agent topologies, we vary memory depth and network structure. Longer memory slows the time to reach steady state in decentralized networks but accelerates it in centralized ones; the same parameter pushes the system in opposite directions depending on topology. Critically, "faster settling" in centralized networks means locking in to a fragmented plateau more quickly, not reaching system-wide consensus, which can be used to generate diverging opinions. We further document a memory-mediated speed-unity trade-off: centralized networks consistently preserve more competing conventions than decentralized networks, but their settling speed depends sharply on memory. At the agent level, within-network analyses show that high-betweenness bridges suffer a brokerage penalty while agents in locally clustered neighborhoods achieve higher coordination success. Finally, in search of analytically tractable generative mechanisms, we find that agents' choices are well captured by Fictitious Play, indicating belief-based rather than reward-based adaptation. The practical implication: memory depth and communication topology should be co-designed, not optimized in isolation.

111. 【2606.04194】raining-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

链接https://arxiv.org/abs/2606.04194

作者:Christian Lysenstøen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:long multi-session histories, Turn Isolation Retrieval, long-term conversational memory, past turns, query across long

备注: 9 pages, 3 figures, 10 tables. Code, data, and per-table receipts: [this https URL](https://github.com/Chrislysen/opsem)

点击查看摘要

Abstract:Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, "Turn Isolation Retrieval") beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory's). We make no claim to a complete memory architecture; this is a retrieval-stage study.

112. 【2606.04189】ACAT: A Collaborative Platform for Efficient Aspect-Based Sentiment Dataset Annotation

链接https://arxiv.org/abs/2606.04189

作者:Ana-Maria Luisa Mocanu,Ciprian-Octavian Truica,Elena-Simona Apostol

类目:Computation and Language (cs.CL)

关键词:Aspect-Based Sentiment Analysis, train reliable models, Sentiment Analysis, sentiment analysis Collaborative, requires high-quality datasets

备注: Accepted at The 28th International Conference on Big Data Analytics and Knowledge Discovery (DaWak 2026)

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) requires high-quality datasets to train reliable models. However, existing annotation tools treat output as flat files, leaving researchers to manually consolidate multi-annotator data, reconstruct relational structures, and compute reliability metrics through custom scripts. This paper introduces ACAT (Aspect-based sentiment analysis Collaborative Annotation Tool), a web-based platform natively supporting four ABSA workflows: (1) Aspect-Category Sentiment Analysis, (2) Clause-Level Segmentation, (3) Aspect-Term Sentiment Analysis with character-level position tracking, and (4) Aspect Sentiment Triplet Extraction with dual span offset preservation. Its core contribution is an automated Extract, Transform, Load (ETL) pipeline that aligns collaborative annotations and computes Inter-Annotator Agreement (IAA) metrics directly at export, yielding training-ready datasets. In a preliminary validation on 1,002 restaurant reviews with two annotators of differing expertise, ACAT achieves a median annotation time of 31.58 seconds and a raw IAA ranging from 0.78 to 0.86 across all tasks.

113. 【2606.04177】A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models

链接https://arxiv.org/abs/2606.04177

作者:Yassir El Attar,Esra Dönmez,Maximilian Maurer,Agnieszka Falenska

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:non-expert users, Interpretable linguistic features, offer a promising, promising approach, approach for explaining

备注: preprint

点击查看摘要

Abstract:Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.

114. 【2606.04160】Expert-Aware Refusal Steering

链接https://arxiv.org/abs/2606.04160

作者:Anna C. Marbut,Daniel R. Olson,Travis J. Wheeler

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, instruction-tuned large language, Safety alignment, language models, model ability

备注: Under review for COLM 2026

点击查看摘要

Abstract:Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.

115. 【2606.04155】SocialCoach: Personalized Social Skill Learning with RL-based Agentic Tutoring and Practice

链接https://arxiv.org/abs/2606.04155

作者:Tianfu Wang,Max Xiong,Jianxun Lian,Hongyuan Zhu,Zhengyu Hu,Yuxuan Lei,Linxiao Gong,Xiaofang Li,Peiting Tsai,Nicholas Jing Yuan,Qi Zhang

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:today interconnected world, interconnected world, negotiation and leadership, leadership are crucial, crucial for personal

备注

点击查看摘要

Abstract:Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.

116. 【2606.04127】When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

链接https://arxiv.org/abs/2606.04127

作者:Erfan Nourbakhsh,Rocky Slavin,Ke Yang,Anthony Rios

类目:Computation and Language (cs.CL)

关键词:Medical question answering, question answering, factual errors, Medical question, Medical

备注: 9 Pages, accepted to BioNLP Workshop at ACL 2026

点击查看摘要

Abstract:Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1-2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model's limited ability to use retrieved evidence effectively.

117. 【2606.04120】SaliMory: Orchestrating Cognitive Memory for Conversational Agents

链接https://arxiv.org/abs/2606.04120

作者:Kai Zhang,Xinyuan Zhang,Hongda Jiang,Shiun-Zu Kuo,Hyokun Yun,Ejaz Ahmed,Shereen Oraby,Ziyun Li,Sanat Sharma,Ann Lee,Ahmed A Aly,Anuj Kumar,Raffay Hamid,Xin Luna Dong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:maintain persistent memory, Conversational agents, serve as lifelong, lifelong companions, companions must maintain

备注

点击查看摘要

Abstract:Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

118. 【2606.04118】Computational conceptual history of scientific concepts: From early digital methods to LLMs

链接https://arxiv.org/abs/2606.04118

作者:Michael Zichert,Arno Simons

类目:Computation and Language (cs.CL)

关键词:article situates large, situates large language, large language models, sociology of science, article situates

备注: 19 pages, chapter in the book Understanding Science with Large Language Models? (pp. 383-412). transcript. Edited by Arno Simons, Adrian Wüthrich, Michael Zichert, Gerd Graßhoff (eds.)

点击查看摘要

Abstract:This article situates large language models (LLMs) within the longer history of computational approaches to concept analysis in the history, philosophy, and sociology of science (HPSS). We examine what LLMs add to existing methods, how they inherit longstanding problems, and review recent case studies that employ them. In the first part, we reconstruct computational conceptual history before LLMs by bringing together three strands of work: early digital methods in HPSS, distributional approaches from digital history and related research, and lexical semantic change detection. We provide an overview of the main challenges and opportunities, focusing on corpus construction, operationalization and modelling choices, and evaluation and interpretation. In the second part, we turn to the era of LLMs, starting with a short introduction to LLMs before reviewing LLM-based work on lexical semantic change detection and relevant case studies in HPSS. We then revisit the earlier methodological questions, showing how issues of corpus construction, model choice and training data, operationalization trade-offs, and evaluation and interpretation play out in LLM-based workflows.

119. 【2606.04109】Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

链接https://arxiv.org/abs/2606.04109

作者:Jianguo Zhu

类目:Computation and Language (cs.CL)

关键词:Context-augmented language model, behavior remains underexplored, reader-model behavior remains, Context-augmented language, language model systems

备注: Preprint. 1 figure, 9 tables

点击查看摘要

Abstract:Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

120. 【2606.04095】POLARIS: Guiding Small Models to Write Long Stories

链接https://arxiv.org/abs/2606.04095

作者:Rishanth Rajendhran,Jenna Russell,Mohit Iyyer,John Frederick Wieting

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:long-form creative writing, Small open-weight models, Small open-weight, quality significantly degrades, creative writing

备注

点击查看摘要

Abstract:Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or their quality significantly degrades as length increases, especially when compared to frontier models. We present POLARIS (Policy Optimization with LLM-as-a-judge rewards and Anchored-Reference Injection for Storywriting), a lower-compute GRPO recipe with two key ingredients: a frontier LLM judge with a structured Story Quality rubric as the online reward, and human-reference injection (HRI), where a teacher-forced human-written story serves as a high-reward anchor within each GRPO group. By applying our training recipe to Qwen3.5-9B, using a dataset of approximately 1.4K prompt-story pairs derived from 100 short-story anthologies and 4 A100 GPUs, we obtain POLARIS-9B. Across five benchmarks spanning in-distribution and out-of-distribution prompts and rubrics, POLARIS-9B is competitive with much larger open-weight models while following length instructions more closely. A blinded human evaluation confirms that POLARIS-9B is preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B. Despite training only on stories up to 4k words, POLARIS-9B preserves quality on prompts requesting stories up to 3 times the training length, a regime where most open-weight models degrade substantially in quality, length adherence, or both. More broadly, our results suggest that length generalization is a meaningful stress test for creative-writing models and a useful lens for distinguishing otherwise close models.

121. 【2606.04075】Large Language Models Hack Rewards, and Society

链接https://arxiv.org/abs/2606.04075

作者:Wei Liu,Xinyi Mou,Hanqi Yan,Zhongyu Wei,Yulan He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

关键词:enabling large language, Reinforcement learning, large language models, enabling large, large language

备注: 14 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

122. 【2606.04071】Covert Influence Between Language Models

链接https://arxiv.org/abs/2606.04071

作者:Avidan Shah,Jay Chooi,Jinghua Ou,Shi Feng

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:language models increasingly, models increasingly consume, conditioned to propagate, increasingly consume, behavioral disposition

备注

点击查看摘要

Abstract:As language models increasingly consume one another's outputs, covert influence -- a phenomenon where a sender's payload (the behavioral disposition it is conditioned to propagate) transfers to a receiver through carriers undetectable by humans -- becomes a growing risk. We characterize this risk across three interfaces: supervised fine-tuning, on-policy distillation, and in-context learning, and find that they vary in the scale of influence achievable without leaving behind human-visible traces. Using inference-time per-sample attribution scores, we study covert influence across all three interfaces with the ability to select carriers that amplify training-time influence, unlocking payload transfers that prior work could not achieve. We further provide evidence that covert influence with natural-language carriers is a distinct phenomenon from prior studies using number carriers, as the latter is more resistant to human detection and less portable across model families. Together, these results suggest that the risk surface for covert influence is broader than previously recognized, and we study pointwise attribution scoring methods as a tool to investigate and mitigate it.

123. 【2606.04046】Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

链接https://arxiv.org/abs/2606.04046

作者:Boyuan Xiao,Bohong Chen,Yumeng Li,Ji Feng,Yao-Xiang Ding,Kun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:vision-language decision making, embodied vision-language decision, decision making tasks, manipulation and navigation, vision-language decision

备注: Accepted at ICML 2026

点击查看摘要

Abstract:In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: this https URL.

124. 【2606.04032】Do Transformers Need Three Projections? Systematic Study of QKV Variants

链接https://arxiv.org/abs/2606.04032

作者:Ali Kayyam,Anusha Madan Gopal,M Anthony Lewis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)

关键词:attention formulation playing, central role, standard solution, formulation playing, playing a central

备注: Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: [this https URL](https://github.com/anushamadan02/Do-Transformers-Need-3-Projections)

点击查看摘要

Abstract:Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at this https URL

125. 【2601.18777】PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

链接https://arxiv.org/abs/2601.18777

作者:Abhishek Divekar,Anirban Majumder

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)

关键词:RAG systems traditionally, Large Language Models, ranking and RAG, Evaluating the quality, RAG systems

备注: Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

点击查看摘要

Abstract:Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

126. 【2606.04680】Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

链接https://arxiv.org/abs/2606.04680

作者:Zhihan Li,Hankun Wang,Yiwei Guo,Bohan Li,Xie Chen,Kai Yu

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:systems commonly rely, internal confidence estimation, Automatic speech recognition, Automatic speech, auxiliary language models

备注: Submitted to Interspeech 2026. 6 pages, 4 figures

点击查看摘要

Abstract:Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

信息检索

1. 【2606.05040】SearchLog: A Web Browser Extension for Capturing Search Logs in Laboratory Studies

链接https://arxiv.org/abs/2606.05040

作者:Jiaman He,Riccardo Xia,Dana McKay,Damiano Spina,Johanne R. Trippas

类目:Information Retrieval (cs.IR)

关键词:information seeking settings, Natural search logs, seeking settings, search, valuable for studying

备注

点击查看摘要

Abstract:Natural search logs are valuable for studying search behavior in information seeking settings. We present SearchLog, an easy-to-install web browser extension for collecting natural search logs during lab-based studies. SearchLog allows participants to search the open web using a browser while recording structured interaction data across mouse, keyboard, search activity, and browser state modules. The extension captures clicks, scrolling, hovered text, typed words, search queries, result rankings, AI-generated summaries when available, tab activity, and window changes. A local Flask backend stores each session as an ordered JSON event stream, with HTML snapshots and preprocessed search result data for later analysis. These logs can be used to derive measures such as query reformulation, page visits, dwell time, scroll behavior, tab switching, search path complexity, and exposure to AI-generated search content. By supporting natural browser-based search with structured experimental metadata, SearchLog provides a reusable resource to study search behavior across traditional and AI-enhanced search interfaces.

2. 【2606.04957】NLLog: Lightweight, Explainable SOC Anomaly Detection via Log-to-Language Rewriting

链接https://arxiv.org/abs/2606.04957

作者:Samuel Ndichu,Tao Ban,Seiichi Ozawa,Takeshi Takahashi,Daisuke Inoue

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:System-generated logs underpin, rigid template-based format, template-based format hinders, underpin security monitoring, logs underpin security

备注: 15 pages, 11 figures, 12 tables; submitted to ACSAC 2026

点击查看摘要

Abstract:System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.

3. 【2606.04944】Dual-Stream MLP is All You Need for CTR Prediction

链接https://arxiv.org/abs/2606.04944

作者:Kesha Ou,Zhen Tian,Wayne Xin Zhao,Long Zhang,Sheng Chen,Ji-Rong Wen

类目:Information Retrieval (cs.IR)

关键词:significantly boost revenue, Click-through rate, boost revenue, holds a pivotal, pivotal role

备注: Accepted by TKDD

点击查看摘要

Abstract:Click-through rate (CTR) prediction holds a pivotal role in online advertising and recommendation systems, where even small improvements can significantly boost revenue. Existing research primarily focuses on designing dual-stream architectures to capture effective complex feature interactions from both explicit and implicit perspectives. However, these approaches are faced with two major challenges: 1) the high complexity of feature interaction learning, which increases computational demands and the overfitting risk, and 2) the imbalance between explicit and implicit modules, where one module's output may dominate the final prediction. To address these issues, in this paper, we propose Dual-Stream MLP (DS-MLP), a novel feature interaction framework for the CTR prediction task. Specially, it leverages knowledge distillation to consolidate the capacity of learning explicit feature interaction into a main MLP network, while a parallel MLP simultaneously captures implicit feature interactions as a complement. To effectively optimize the dual-stream MLP architecture, we further design a specific learning approach with two alignment strategies for enhancing the compatibility of the two MLP components. Experiments demonstrate that DS-MLP, though merely a vanilla MLP structure (the final model), can achieve state-of-the-art performance across three widely used benchmarks, offering a scalable and efficient solution for large-scale recommendation systems. Our code is available at this https URL.

4. 【2606.04915】Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

链接https://arxiv.org/abs/2606.04915

作者:Zhenyu Yu,Shuigeng Zhou

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, Large language, language models reach, lexical pattern matching, causal reasoning benchmarks

备注

点击查看摘要

Abstract:Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.

5. 【2606.04909】BEATS: Bootstrapping E-commerce Attribute Taxonomies for Search through Iterative Human-AI Collaboration

链接https://arxiv.org/abs/2606.04909

作者:Yung-Yu Shih,Shang-Yu Su,Tzu-I Ho,Dongzhe Wang,Yun-Nung Chen

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:E-commerce platforms, platforms in emerging, emerging markets, E-commerce, structured attribute schemas

备注: 6 pages, 1 figure, 5 tables. Accepted to SIGIR 2026 Industry Track. Official version: [this https URL](https://doi.org/10.1145/3805712.3808520)

点击查看摘要

Abstract:E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.

6. 【2606.04727】EviRank: Evidence-Based Confidence Estimation for LLM-Based Ranking

链接https://arxiv.org/abs/2606.04727

作者:Meng Yan,Cai Xv,Xujing Wang,Ziyu Guan,Wei Zhao

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, Language Models show, Large Language, raise reliability concerns, reliability concerns due

备注

点击查看摘要

Abstract:Large Language Models show promise for recommendation, but they raise reliability concerns due to limited domain coverage and inherent stochasticity. Existing uncertainty quantification methods persist two fundamental challenges: (1) the global confidence score designed for question answering fails to reveal which positions are unreliable in ranking list; (2) fine-grained confidence extracted from model internals exhibits uniformly low values across all positions, making it impossible to filter unreliable predictions. To tackle the challenges, we propose an evidence-based confidence estimation for LLM-based ranking (EviRank). We extract three complementary evidences from a single forward pass and aggregate them via reliable opinion aggregation. Furthermore, we recognize that ranking positions are inherently unequal, and introduce a position-aware calibration. Lastly, the calibrated confidence guides ranking optimization. Experiments on three datasets demonstrate that our method achieves state-of-the-art performance on both recommendation and uncertainty quantification.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.04727 [cs.IR]

(or
arXiv:2606.04727v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.04727

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2606.04650】Improving the Efficiency and Effectiveness of LLM Knowledge Distillation for Conversational Search

链接https://arxiv.org/abs/2606.04650

作者:Stan Fris,Jan Hutter,Jan Henrik Bertrand,Simon Lupart,Mohammad Aliannejadi

类目:Information Retrieval (cs.IR)

关键词:relevant documents based, relevant documents, documents based, KLD, Large Language Models

备注: SCAI Workshop at SIGIR '26}{July 20--24, 2026}{Melbourne, Naarm, Australia

点击查看摘要

Abstract:Conversational Search (CS) considers retrieval of relevant documents based on conversational context. Large Language Models (LLMs) have significantly enhanced CS by enabling effective query rewriting. However, employing LLMs during inference poses efficiency challenges. A method to balance effectiveness and efficiency is the use of knowledge distillation from LLM-based query rewriting. Recent work applies the Kullback-Leibler Divergence (KLD) for distillation, relaxing the alignment with the teacher signal compared to previous methods. Despite these gains, several aspects of KLD-based distillation for conversational search remain understudied, and we investigate them in this work. Prior work in related fields suggests that adding a contrastive loss to the KLD objective can improve performance; we confirm this and observe significant gains in precision-oriented ranking metrics. We also find that contrastive sampling strategies for the KLD loss have a non-trivial impact and must be chosen carefully. Although theory suggests that more samples improve the KLD estimate, experiments show diminishing returns on the number of used samples. Finally, we address the phenomenon of decreased sparsity in longer conversations, which limits computational efficiency across sparse retrieval methods. We find that the representations from the model distilled with the KLD loss can be strongly regularized with a regularization loss, substantially improving sparsity and inference efficiency without significantly harming retrieval effectiveness. We achieve a $2\times$ decrease in FLOPS on TopiOCQA with negligible loss in effectiveness, corresponding to a $\leq 2%$ drop in Recall@100. Our results provide insights into distillation objectives for learned sparse conversational retrievers and offer practical guidelines for improving effectiveness and efficiency in first-stage retrieval.

Comments:
SCAI Workshop at SIGIR '26}{July 20–24, 2026}{Melbourne, Naarm, Australia

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2606.04650 [cs.IR]

(or
arXiv:2606.04650v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.04650

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2606.04646】QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

链接https://arxiv.org/abs/2606.04646

作者:Mengao Zhang,Xiang Yang,Chang Liu,Tianhui Tan,Ke-wei Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:scientific corpora, corpora are natural-language, natural-language versions, versions of database-style, database-style queries

备注: 14 pages

点击查看摘要

Abstract:Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.

9. 【2606.04603】Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

链接https://arxiv.org/abs/2606.04603

作者:Olivier Jeunen

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:real-world recommender systems, Approximate Nearest Neighbour, Nearest Neighbour search, search indices form, Neighbour search indices

备注

点击查看摘要

Abstract:Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.

Subjects:

Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

Cite as:
arXiv:2606.04603 [cs.IR]

(or
arXiv:2606.04603v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2606.04603

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2606.04557】Cartridges at Scale: Training Modular KV Caches over Large Document Collections

链接https://arxiv.org/abs/2606.04557

作者:Momchil Hardalov,Gonzalo Iglesias,Adrià de Gispert

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, content remains static, Models can reason

备注: 21 pages, 5 figures, 17 tables

点击查看摘要

Abstract:Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable key-value (KV) caches that eliminate prefilling while preserving accuracy. A critical limitation of this approach is that cartridges are monolithic and non-compositional: encoding an entire collection into a single KV block does not scale, and naively mixing cartridges trained in isolation collapses performance to near chance. We introduce Cartridges at Scale (CAS), a training framework for scalable multi-cartridge learning with dynamic distractor mixing and a memory-efficient budget manager that rotates hundreds of per-document cartridges between GPU and persistent storage. Our approach scales to collections exceeding a million tokens, improving over a monolithic cartridge by 10-31 points at comparable token budgets. Oracle cartridge accuracy falls within 2-6 points of full in-context learning even at high compression. When paired with retrieval for cartridge selection, CAS matches or exceeds conventional RAG accuracy while consuming 3-4x fewer prompt tokens.

11. 【2606.04550】rading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations

链接https://arxiv.org/abs/2606.04550

作者:Noah Lund Syrdal,Anders Vestrum,Jorgen Bergh

类目:Information Retrieval (cs.IR)

关键词:recommender systems strongly, systems strongly influence, E-commerce recommender systems, Product Carbon Footprint, recommender systems

备注: 23 pages, 30 figures. Code available at [this https URL](https://github.com/andersvestrum/carbon-aware-recsys)

点击查看摘要

Abstract:E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Carbon Footprint (PCF) are almost never available at catalog scale. We study carbon-aware product recommendation in the realistic setting where PCF labels are missing for most items and must be inferred. We first estimate product-level carbon footprints via a retrieval-augmented PCF estimation pipeline that transfers supervision from the Carbon Catalogue, a small set of life-cycle-assessed products, to a large unlabeled e-commerce catalog using semantic similarity search, few-shot LLM prompting, and a nearest-neighbour fallback. We then apply a carbon-aware post-hoc re-ranking strategy on top of relevance scores produced by three established recommendation models: BPR, NeuMF, and LightGCN. The method trades off predicted user-item engagement against estimated carbon footprint through a single tunable parameter, lambda. In this offline study, engagement is operationalized through Amazon review interactions, which serve as implicit feedback and as a proxy for user interest or purchase behavior. We evaluate the framework on the Amazon Reviews dataset across three product categories: Home and Kitchen, Sports and Outdoors, and Electronics. By sweeping lambda, we construct Pareto frontiers that characterize the achievable engagement and carbon trade-off for each model and category. Substantial carbon reductions are achievable at minimal engagement cost across all models and categories. However, the available carbon headroom varies by model and category, underscoring the importance of model choice and domain context.

12. 【2606.04547】Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization

链接https://arxiv.org/abs/2606.04547

作者:Heng Cao,Fan Zhang,Jian Yao,Yujie Zheng,Changlin Zhao,Lu Hao,Yuxuan Wei,Wangze Ni,Huaiyu Fu,Yuqian Sun,Xuyan Mo

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Personalizing large language, large language models, language models requires, models requires adapting, requires adapting model

备注: 16 pages, 6 figures

点击查看摘要

Abstract:Personalizing large language models requires adapting model behavior to individual users while preserving robustness and deployment-scale efficiency. Existing approaches typically personalize LLMs either at the input level, by retrieving user histories or constructing profile prompts, or at the parameter level, by maintaining user-specific parameter-efficient modules. The former makes personalization sensitive to retrieval quality and prompt design, whereas the latter incurs storage and maintenance costs that grow with the user population. To address these limitations, we propose TAP-PER (Temporal Attentive Prefix for PERsonalization), a prefix-based framework that encodes user preferences as learnable representations, eliminating explicit prompt construction and replacing heavy per-user adapters with lightweight user-state prefix embeddings. Inspired by personalized recommendation systems, TAP-PER decomposes user modeling into user-state and query-conditioned components, and incorporates temporal signals to capture the evolving nature of user interests. Experiments on six LaMP tasks show that TAP-PER consistently outperforms prompt-based and model-based baselines across classification, rating, and generation settings. Moreover, TAP-PER uses 130x fewer per-user parameters than OPPU and roughly half the total parameter footprint of PER-PCS at the 1,000-user scale, demonstrating that scalable LLM personalization can be achieved without explicit prompt construction or heavy per-user adapters.

13. 【2606.04522】ANN Search: Recall What Matters

链接https://arxiv.org/abs/2606.04522

作者:Dimitris Dimitropoulos,Nikos Mamoulis

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)

关键词:Approximate nearest neighbor, modern machine learning, Approximate nearest, machine learning tasks, Ratio

备注

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from classification to retrieval-augmented generation. The community evaluates and tunes ANN algorithms primarily on their throughput at a given Recall@k, the fraction of true exact neighbors retrieved. We argue that what really matters in ANN search is the quality of the retrieved results and not their overlap with the true kNN set. We show that using Recall@k to assess retrieval quality forces unnecessary computational overhead and investigate replacing it by 1/Ratio@k, the inverse approximation ratio. 1/Ratio@k evaluates the differences between the distances of the retrieved and true neighbors. It is judge-free, hyperparameter-free, and computable from standard ANN benchmark inputs alone. We benchmark state-of-the-art ANN algorithms across diverse datasets spanning a wide range of intrinsic dimensionalities, evaluating the two metrics comprehensively across efficiency, downstream classification, and retrieval-augmented generation. On the efficiency axis, optimizing for 1/Ratio@k reaches operational quality thresholds at a substantially lower computational cost than Recall@k. In downstream tasks, performance indicators (label precision, semantic similarity, BERTScore, and LLM-graded quality) remain highly stable even when Recall@k drops significantly. The inverse approximation ratio, on the other hand, closely mirrors this stability, tracking true utility much better than Recall@k. Ultimately, while Recall@k overstates the true cost of approximation, 1/Ratio@k offers a more accurate, deployable proxy for actual ANN quality.

14. 【2606.04514】SAILRec: Steering LLM Attention to Dual-Side Semantically Aligned Collaborative Embeddings for Recommendation

链接https://arxiv.org/abs/2606.04514

作者:Xi Wu,Jiale Wang,Zihan Wang,Yichen Gao,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang

类目:Information Retrieval (cs.IR)

关键词:enhance language models, Recent LLM-based recommenders, recommenders enhance language, Recent LLM-based, LLM-based recommenders enhance

备注: 17 pages, including appendices

点击查看摘要

Abstract:Recent LLM-based recommenders enhance language models with collaborative embeddings from user-item interactions, but making such embeddings available does not ensure their proper use during inference. Through a diagnostic attention analysis, we find that the utilization of collaborative embeddings is depth-dependent and alignment-sensitive, suggesting that LLMs need to balance their internal semantic knowledge with external collaborative knowledge. To address this issue, we propose SAILRec, an LLM-based recommender that improves this balance through dual-side semantic alignment and hierarchical attention steering. The former aligns item-side embeddings with item-text semantics and user-side embeddings with codebook-based semantic profiles, while the latter suppresses premature shallow-layer collaborative interference and strengthens collaborative evidence in deeper decision layers. Experiments on MovieLens-1M and Amazon-Book show that SAILRec consistently outperforms representative baselines, with ablation and masking analyses validating its key designs.

15. 【2606.04448】Bridging Short Videos and Live Streams: Reasoning-Guided Multimodal LLMs for Cross-Domain Representation Learning

链接https://arxiv.org/abs/2606.04448

作者:Le Zhang,Xiaolan Zhu,Yuchen Wang,Shilong Kang,Jiaqi Xue,Xiaoyu Zhang,Xiang Chen,Yalong Guan,Xiangyu Wu,Shijun Wang,Lantao Hu,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:streaming services grow, platforms offer short, short videos, offer short videos, live streaming

备注: 9 pages

点击查看摘要

Abstract:As live streaming services grow, many platforms offer short videos and live streams to meet diverse needs. Short videos carry substantial traffic and rich behavior signals, whereas live streaming is a core conversion scenario with sparse behavior data, making cold start severe. Transferring user interests from short videos to live streaming recommendation can alleviate these issues. Meanwhile, short videos and live streams are complex multimodal items, and integrating multimodal signals improves recommendation performance. Although Multimodal Large Language Models (MLLMs) show strong multimodal understanding and reasoning, their application to cross-domain recommendation remains underexplored. To this end, we propose Reasoning-Guided Cross-Domain Representation Learning (RGCD-Rep), a reasoning-guided framework for cross-domain recommendation from short videos to live streams. RGCD-Rep introduces MLLM reasoning resource-efficiently and learns transferable item representations guided by behavioral collaboration via two-stage training. First, reasoning-aware distillation lets a frozen teacher MLLM generate structured cross-domain reasoning knowledge and distills it into a lightweight student MLLM. Second, transferability-guided cross-domain representation learning decomposes item representations into transferable and domain residual representations. The resulting representations are computed offline and integrated into downstream retrieval tasks, enabling low-cost industrial deployment. Extensive offline experiments demonstrate RGCD-Rep's superiority. After deployment in Kuaishou's live streaming recommendation system, A/B tests show significant gains across multiple core business metrics, confirming its effectiveness and practicality in real industrial scenarios. RGCD-Rep is fully deployed and serves over 400 million users daily.

16. 【2606.04435】Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

链接https://arxiv.org/abs/2606.04435

作者:Saroj Mishra

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:mechanisms systematically miss, incorrect final outputs, demonstrated significant capability, factually incorrect final, Cascading Hallucination Aware

备注

点击查看摘要

Abstract:Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

17. 【2606.04397】Context-as-a-Service: Surfacing Cross-File Dependency Chains for LLM-Generated Developer Documentation

链接https://arxiv.org/abs/2606.04397

作者:Ameya Gawde,Vyzantinos Repantis,Harshvardhan Singh,Lucy Moys

类目:oftware Engineering (cs.SE); Information Retrieval (cs.IR)

关键词:LLM agents increasingly, maintain developer documentation, agents increasingly write, LLM agents, LLM agents query

备注: 8 pages, 2 figures, 4 tables

点击查看摘要

Abstract:LLM agents increasingly write and maintain developer documentation, but usefulness and accuracy often rely on dependency chains that are not obvious to follow. Even with more files in context, the agent must still decide which cross-file dependencies to trace. We present Context-as-a-Service (CaaS), a retrieval layer that LLM agents query to find evidence across the codebase as they review or generate documentation. CaaS indexes source code, API references, and upstream documentation, then enables agents to query the index through tool calls that combine keyword and semantic search. We evaluate CaaS in two case studies using Claude Sonnet 4.6 on a production SDK: improving API reference comments in a core source file and validating an LLM-generated tutorial. In both studies, the baseline already had ordinary repository tools such as file reads, keyword search, and symbol navigation. CaaS adds a retrieval layer on top, so the comparison isolates added retrieval rather than basic repository access. In the API-reference review, the CaaS-augmented agent produced the same 5 missing-documentation fixes as the baseline and surfaced 4 findings the baseline missed: 2 cross-file factual errors and 2 underspecified API comments. In the tutorial validation, it surfaced 1 executable bug, 1 API-usage improvement, and 2 missing prerequisites that the baseline pipeline did not catch. These findings required tracing non-obvious dependency chains across utility files, framework internals, usage examples, tests, and component-creation logic. Over five runs per condition, adding CaaS reduced wall-clock time by 22\% to 34\% across the two tasks and lowered input-token usage.

18. 【2606.04387】Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking

链接https://arxiv.org/abs/2606.04387

作者:Chenyu Zhang,Yiwen Liu,Yin Sun,Xinyuan Zhang,Yuji Cao,Junming Jiao,Juyi Qiao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:e-commerce recommendation due, prolonged decision cycles, real estate, Large Language Models, high-stakes domains

备注

点击查看摘要

Abstract:Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to prolonged decision cycles and multi-stage funnels. Traditional lead scoring methods rule-based scorecards, machine learning, or pointwise CTR models face severe challenges: sparse supervision, a semantic gap in unstructured CRM logs, and inability to capture relative lead priority. While Large Language Models(LLMs) offer superior semantic understanding of customer interactions, general-purpose LLMs are ill-suited for lead ranking: they generate text rather than comparable scores, and lack alignment with the hierarchical priorities of sales funnels. We introduce an LLM-based discriminative framework for sales lead scoring, which supports joint modeling of structured CRM features and unstructured customer interactions. On top of this framework, we propose HPRO (Hierarchical Preference Ranking Optimization), which augments sales lead scoring with a hierarchical preference ranking objective. HPRO employs a margin-aware Bradley-Terry formulation to transform sparse binary labels into dense, funnel-aware preference pairs, enabling lead scoring to leverage both pointwise and pairwise supervision. Experiments on large-scale data from a leading NEV brand demonstrate state-of-the-art classification (AUC 0.8161) and ranking performance (+39.7% precision among top-ranked leads). A 132-day online A/B test validates 9.5% sales volume uplift, confirming real-world commercial impact.

19. 【2606.04382】LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment

链接https://arxiv.org/abs/2606.04382

作者:Kwok Leong Tang

类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Automated subject cataloging, standard public benchmark, subject cataloging assigns, cataloging assigns controlledvocabulary, Automated subject

备注

点击查看摘要

Abstract:Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.

20. 【2606.04374】DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling

链接https://arxiv.org/abs/2606.04374

作者:Bokang Wang,Xing Fang,Mingmin Jin,Jing Wang,Zhentao Song,Guangxin Song,Jianbo Zhu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:fine-grained attribute distinctions, long-standing open problem, capturing fine-grained attribute, e-commerce search relevance, attribute distinctions

备注: Jing Wang (Corresponding Author)

点击查看摘要

Abstract:Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturing fine-grained attribute distinctions. While discrete Semantic Identifiers (SIDs) have been widely adopted as a promising alternative, existing SID generation methods rely heavily on unsupervised quantization. In realistic scenarios, the lack of explicit supervision often makes it more difficult to dictate which items should share an SID, resulting in limited capability for query-dependent ranking. To address the issue of unsupervised SIDs, we propose to explicitly model discrete relevance features and develop a Discrete Semantic Identifier Relevance Model (DSIRM). Specifically, we present a query-bridged contrastive quantization approach on the item side, injecting query-item interaction supervision into Residual Quantization to actively learn relevance-aware semantic partitions. On the other hand, we explore generative LLMs on the query side to explicitly predict item SIDs from text, resolving tail queries and intent ambiguity. Hierarchical prefix matching between query and item SIDs yields discriminative features that perfectly complement dense signals. Extensive experimental results on Tmall's production data show that our proposed approach has achieved better results, improving offline AUC by +1.54\%. Deployed via an efficient hybrid architecture, it achieves significant online lifts (+0.13\% UCTR, +0.25\% UCTCVR), proving its massive industrial value.

21. 【2606.04362】Disentangling Answer Engine Optimization from Platform Growth: A Log-Based Natural Experiment on ChatGPT Referral Traffic

链接https://arxiv.org/abs/2606.04362

作者:Keisuke Watanabe,Kazuki Nakayashiki

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Answer Engine Optimization, search engine optimization, called Answer Engine, engine optimization, send measurable referral

备注: 9 pages, 4 figures, 1 table

点击查看摘要

Abstract:Large language model (LLM) "answer engines" such as ChatGPT now send measurable referral traffic to the open web, and a practice analogous to search engine optimization, here called Answer Engine Optimization (AEO), has emerged. Public AEO success stories typically quote large raw growth multiples, but raw referral growth is confounded by the rapid platform-level growth of the answer engines themselves. We report a longitudinal field study on a single high-traffic domain (this http URL) whose corpus of hundreds of thousands of YouTube question-and-answer pages received a defined bundle of AEO interventions in January 2026 (detailed in Section 4). Because the interventions were concentrated on one subset of the site, the untreated remainder of the same domain acts as a contemporaneous control that absorbs the platform tailwind. Using first-party analytics and server logs rather than probabilistic third-party estimators, we find: (1) raw growth is dominated by the platform tailwind: on monthly aggregates total ChatGPT referrals grew 5.7x while untreated pages on the same domain grew 3.5x over the same window; (2) an interrupted time-series model on the weekly treated/control ratio estimates a discrete, intervention-aligned level increase of 1.82x (95% CI 1.31-2.54, HAC p=0.001), robust across engagement-filtered traffic (2.27x) and alternative specifications; (3) however, a conservative placebo-in-time permutation test yields p=0.16, so the effect is suggestive, not conclusive, given a short and noisy pre-period; and (4) Google organic clicks to treated pages did not fall beyond the ambient site-wide trend and indexation was preserved, consistent with the SEO-protection rule. The methodological message, separating treatment from platform tailwind with an on-domain control, matters more than any single multiple, and implies that headline AEO multiples substantially overstate causal effect.

22. 【2606.04308】Creative Reading: Scaffolding Reading for Transformation

链接https://arxiv.org/abs/2606.04308

作者:Sophia Liu,Sarah Abowitz,Yijun Liu,Sarah Sterman,Shm Garanganao Almeda,Max Kreminski

类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词:augmentation systems increasingly, Reading augmentation systems, Reading, systems increasingly, Reading augmentation

备注

点击查看摘要

Abstract:Reading augmentation systems increasingly help readers process text at scale. While these tools address real constraints of time and cognitive load, they often implicitly frame reading as information transmission, or "reading to discard," delegating interpretation and effort to the machine. Yet this delegation changes the outcome of reading. For example, in scholarly reading, deciding what a research text implies and why it matters is central to the work of scholarly production. We propose creative reading as an alternative goal: reading augmentation that supports readers in creating both readings and themselves as readers. By putting literary and narrative theories into conversation with scholarly sensemaking and creativity support, we present a provocation-oriented design space for valuing the process of reading as a way of preserving a plurality of readings and transforming readers over time.

23. 【2606.04300】Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

链接https://arxiv.org/abs/2606.04300

作者:Abdelrahman Abdallah,Mahmoud Abdalla,Mohammed Ali,Adam Jatowt

类目:Information Retrieval (cs.IR)

关键词:visual token embeddings, vision-language retrievers represent, Late-interaction vision-language retrievers, visual token, score queries

备注

点击查看摘要

Abstract:Late-interaction vision-language retrievers represent each document page as many visual token embeddings and score queries with MaxSim. In systems such as ColPali, ColQwen, ColNomic, and Nemotron ColEmbed, the document embeddings are produced without seeing the query, so the same page is represented identically for a table lookup, a chart question, and a layout-sensitive evidence request. We introduce \textbf{Argus}, a family of query-conditioned late-interaction retrievers built on Qwen3.5-VL. Argus adds a region-aware Mixture-of-Experts module: the query encoder produces both retrieval embeddings and a compact context vector, the document page is pooled into spatial regions, and a query-aware router selects latent experts per region before MaxSim. The output remains a multi-vector index compatible with ColPali-style retrieval, but the document representation is now dependent on the query (i.e., $\mathbf{D}(q)$). All Argus models use a 1024-dimensional retrieval head, compared with the 2560-dimensional and 4096-dimensional heads of recent state-of-the-art systems, and are trained on roughly 9\% of the available public supervision rather than the full pool. The 9B model reaches \textbf{92.67} NDCG@5 on ViDoRe V1 and \textbf{86.0} NDCG@5 on the combined V1+V2 leaderboard, the highest reported value for an open late-interaction model on the combined leaderboard. Wrapped in a Qwen3.6-27B agentic retrieval pipeline on ViDoRe V3, Argus-9B further improves its NDCG@10 from 60.28 to \textbf{64.80} over public tasks, showing that the same retriever serves both as a strong standalone system and as a search primitive for iterative LLM agents.

24. 【2606.04280】he Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning

链接https://arxiv.org/abs/2606.04280

作者:Justinas Zaliaduonis,Patrick Putzky,Till Richter,Sergios Gatidis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:remain incompletely understood, geometry remain incompletely, incompletely understood, latent geometry remain, meaningful latent geometry

备注

点击查看摘要

Abstract:Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers meaningful latent geometry remain incompletely understood. We develop a measure-theoretic framework formalizing the diversity condition, a support requirement on positive-pair sampling that is necessary for isometric latent recovery. We show that the standard full-support von Mises-Fisher setting implies the satisfaction of the diversity condition and as a consequence global contrastive loss minimizers recover latent geometry up to orthogonal transformation, while restricted conditionals can make non-orthogonal maps attain strictly lower asymptotic contrastive loss. We introduce a support-corrected Information Noise Contrastive Estimation (InfoNCE) variant as a theoretical fix: this correction makes orthogonal latent space recovery achievable but does not uniquely select it. Experiments on synthetic benchmarks validate the identifiability predictions, and CIFAR-10 experiments are consistent with the qualitative prediction that architectural inductive bias becomes more important when sampling diversity is limited. Together, our results clarify how sampling mechanisms and encoder inductive bias interact in contrastive representation learning.

25. 【2606.04194】raining-Free Lexical-Dense Fusion for Conversational-Memory Retrieval

链接https://arxiv.org/abs/2606.04194

作者:Christian Lysenstøen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:long multi-session histories, Turn Isolation Retrieval, long-term conversational memory, past turns, query across long

备注: 9 pages, 3 figures, 10 tables. Code, data, and per-table receipts: [this https URL](https://github.com/Chrislysen/opsem)

点击查看摘要

Abstract:Retrieving the few past turns that answer a new query across long multi-session histories is the retrieval bottleneck behind long-term conversational memory (LoCoMo, LongMemEval). Recent concurrent work, Nano-Memory, shows that scoring a session by the maximum query-turn similarity (late interaction, "Turn Isolation Retrieval") beats mean-pooled session embeddings. We do not claim that effect; we replicate it and ask what a training-free, CPU-only retrieval stage should add around it. We report four findings. (1) Fuse: score-level fusion of the late-interaction dense score with BM25, under a single leave-one-conversation-out weight, adds +8.8 to +17.2 points of LoCoMo Hit@1 over late interaction alone across six encoders (all p1e-4), reaching Hit@1 0.752 / NDCG@5 0.829 (e5-large-v2), +11.2 pp over BM25. (2) An off-the-shelf web-search cross-encoder reranker over the fused top-10 hurts here, degrading Hit@1 by 6.9 pp (one reranker, one configuration). (3) A pooling-operator ablation shows top-k late interaction matches max-similarity, but a naive smooth-max (log-sum-exp) collapses for half the encoders. (4) The late-minus-early gap is large for all six encoders and tends to be larger for larger ones, while the marginal fusion gain shrinks; on LongMemEval-S, a lexical regime where BM25 saturates, the net fusion gain over BM25 is small and not significant. A per-category analysis frames the gain as a division of labor: dense late interaction helps most on multi-hop and temporal questions but trails BM25 on adversarial ones. The contribution is a controlled, reproducible account of a strong training-free retrieval recipe, not the late-interaction retriever itself (Nano-Memory's). We make no claim to a complete memory architecture; this is a retrieval-stage study.

26. 【2601.18777】PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

链接https://arxiv.org/abs/2601.18777

作者:Abhishek Divekar,Anirban Majumder

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP)

关键词:RAG systems traditionally, Large Language Models, ranking and RAG, Evaluating the quality, RAG systems

备注: Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

点击查看摘要

Abstract:Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

27. 【2606.04755】Archi: Agentic Operations at the CMS Experiment

链接https://arxiv.org/abs/2606.04755

作者:Pietro Lugato,Luca Lavezzo,Jason Mohoney,Hasan Ozturk,Muhammad Hassan Ahmed,Juan Pablo Salas,Viphava Ohm,Krittin Phornsiricharoenphant,Gabriele Benelli,Mariarosaria D'Alfonso,Manasvita Joshi,Warren Nam,Aron Soha,Samantha Sunnarborg,Austin Swinney,Jack Tucker,Dmytro Kovalskyi,Tim Kraska,Christoph Paus

类目:High Energy Physics - Experiment (hep-ex); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:heterogeneous data sources, Computing Operations team, framework for scientific, deployment of configurable, present Archi

备注

点击查看摘要

Abstract:We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

计算机视觉

1. 【2606.05162】Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

链接https://arxiv.org/abs/2606.05162

作者:Jaeyeong Kim,Ines Kim,Jahyeok Koo,Seungryong Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shape generation conditioned, feed-forward framework, controllable dynamic, shape generation, generation conditioned

备注: Project page: [this https URL](https://cvlab-kaist.github.io/T2Mo/)

点击查看摘要

Abstract:We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.

2. 【2606.05149】An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

链接https://arxiv.org/abs/2606.05149

作者:Gandhimathi Padmanaban,Fred Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:cyclist injury severity, Vehicle body type, naturalistic roadway video, body type, significant determinant

备注: 24 pages, 10 figures, venue TBD

点击查看摘要

Abstract:Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.

3. 【2606.05142】GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

链接https://arxiv.org/abs/2606.05142

作者:Josef Bengtson,Yaroslava Lochman,Fredrik Kahl

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent developments, content generation, generation and customization, generative models, models have brought

备注: Project page: [this https URL](https://gem-nr.github.io/)

点击查看摘要

Abstract:Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

4. 【2606.05124】Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

链接https://arxiv.org/abs/2606.05124

作者:Hongyu Zhou,Zorah Lähner

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Gaussian Splatting, geometric surface representation, view synthesis, surface representation, Gaussian

备注

点击查看摘要

Abstract:After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

5. 【2606.05115】Continual Visual and Verbal Learning Through a Child's Egocentric Input

链接https://arxiv.org/abs/2606.05115

作者:Xiaoyang Jiang,Yanlai Yang,Kenneth A. Norman,Brenden Lake,Mengye Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:temporally structured stream, temporally structured, meanings of words, Children learn, egocentric video recordings

备注: 15 pages, 4 figures

点击查看摘要

Abstract:Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

6. 【2606.05107】Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have

链接https://arxiv.org/abs/2606.05107

作者:Elouan Gardès,Seung Eun Yi,Kartik Ahuja,Théo Moutakanni,Huy V. Vo,Piotr Bojanowski,Wolfgang M. Pernice,Loïc Landrieu,Camille Couprie

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generic vision foundation, vision foundation models, specialized scientific domains, propose a label-free, label-free approach

备注

点击查看摘要

Abstract:We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.

7. 【2606.05103】Identifying Gems from Roman RAPIDly

链接https://arxiv.org/abs/2606.05103

作者:Karan Gandhi,Ashish A. Mahabal,Jacob E. Jencson,Russ R. Laher,Ben Rusholme,Lin Yan,Ryan M. Lau,Schuyler D. Van Dyk,Mansi M. Kasliwal

类目:Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Nancy Grace Roman, Grace Roman Space, Roman Space Telescope, Nancy Grace, conduct wide-field infrared

备注: 15 pages, 10 figures, Submitted to the Publications of the Astronomical Society of the Pacific

点击查看摘要

Abstract:The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented spatial resolution and cadence, enabling the discovery of millions of astronomical transients. Hence, it is necessary to have automated pipelines for generating alerts in place so that the telescope can begin discovering reliable transients and variable objects soon after it is launched. However, no real Roman data currently exist, making the development of such pipelines difficult. In this work, we present a machine learning model $RuBR$ and a general methodology for distinguishing genuine transient and variable detections from spurious (bogus) detections within the RAPID pipeline. In particular, we present three models using this methodology: $RuBR_{comb}$ trained and tested on combined locally injected and OpenUniverse2024 transients, $RuBR_{loc}$ trained on locally injected transients and tested on OpenUniverse2024 transients, and $RuBR_{DA}$ that combines locally injected transients with a fraction of OpenUniverse2024 transients in domain-adaptation mode for training. This paves the way for strategies to adapt the $RuBR_{comb}$ model to real observations in the absence of any ground-truth labels during the early phases of the Roman mission. While the image differencing pipeline continues to be improved, our experimental results demonstrate the effectiveness of the proposed approach and its promise for robust real-bogus classification in the Roman era.

8. 【2606.05102】ZipSplat: Fewer Gaussians, Better Splats

链接https://arxiv.org/abs/2606.05102

作者:Alexander Veicht,Sunghwan Hong,Dániel Baráth,Marc Pollefeys

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current approaches predict, Splatting methods reconstruct, Gaussian Splatting methods, single forward pass, Gaussian Splatting

备注

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at ${\href{this https URL}{this https URL}}$.

9. 【2606.05071】InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

链接https://arxiv.org/abs/2606.05071

作者:Jiarui Wu,Yujin Wang,Ruikang Li,Fan Zhang,Mingde Yao,Tianfan Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Language-guided photo retouching, photo retouching aims, Language-guided photo, geometry and texture, aims to adjust

备注: Computer Vision and Pattern Recognition (CVPR), 2026

点击查看摘要

Abstract:Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: this https URL.

10. 【2606.05068】MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution

链接https://arxiv.org/abs/2606.05068

作者:Daeyoung Han,Seongmin Hwang,Moongu Jeon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative Adversarial Networks, Single Image Super-Resolution, Conventional Generative Adversarial, Conventional Generative, strict conditional realism

备注

点击查看摘要

Abstract:Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.

11. 【2606.05058】UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

链接https://arxiv.org/abs/2606.05058

作者:Jingyuan Chen,Sheng Jin,Haopeng Sun,Wentao Liu,Chen Qian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:underpins modern engineering, Computer-Aided Design, underpins modern, creation of precise, modern engineering

备注

点击查看摘要

Abstract:Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

12. 【2606.05035】Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

链接https://arxiv.org/abs/2606.05035

作者:Peilin Tao,Chong Cheng,Yuansen Du,Caiwei Song,Zhengqing Chen,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Hainan Cui,Shuhan Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring continuous camera-motion, online visual mapping, robot perception, requiring continuous, visual mapping

备注

点击查看摘要

Abstract:Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.

13. 【2606.05031】MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

链接https://arxiv.org/abs/2606.05031

作者:Dewei Zhou,Xinyu Huang,Xun Wang,Ji Xie,Yabo Zhang,Liang Li,Kunchang Li,Zongxin Yang,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual models fundamentally, models fundamentally struggle, Generative visual models, fundamentally struggle, visual models

备注

点击查看摘要

Abstract:Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.

14. 【2606.05018】Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives

链接https://arxiv.org/abs/2606.05018

作者:Marco Peer,Thomas Gorges,Mathias Seuret,Vincent Christlein,Andreas Fischer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:labor-intensive manual process, Swiss democracy, Popular initiatives, central to Swiss, manual process

备注: Accepted for presentation at ICCST 2026

点击查看摘要

Abstract:Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.

15. 【2606.05011】CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

链接https://arxiv.org/abs/2606.05011

作者:Yurim Jeon,Dongseong Seo,Seung-Woo Seo

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:aerial image database, image database, Cross-view geo-localization estimates, estimates the geographic, geographic location

备注: 16 pages, 5 figures

点击查看摘要

Abstract:Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at this https URL.

16. 【2606.05008】M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

链接https://arxiv.org/abs/2606.05008

作者:Jie Huang,Ruixun Liu,Sirui Sun,Xinyi Yang,Yin Li,Yixin Zhu,Yiwu Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:long-form video understanding, multi-modal models advance, multi-modal models, advance towards long-form, memory

备注: We present an evaluation designed for multi-modal memory in multi-modal models

点击查看摘要

Abstract:As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at this https URL.

17. 【2606.04992】Multi-Camera AR Guidance System for Surgical Instrument Handling and Assembly: Investigating Workload and Efficiency

链接https://arxiv.org/abs/2606.04992

作者:Shiyu Li,Julian Kreimeier,Hannah Schieber,Dirk Müller,Bernhard Kainz,Rüdiger von Eisenhart-Rothe,Daniel Roth

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:surgery imposes high, imposes high cognitive, high cognitive demands, pose estimation, surgery imposes

备注: 11 pages

点击查看摘要

Abstract:The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.

Comments:
11 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2606.04992 [cs.CV]

(or
arXiv:2606.04992v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.04992

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2606.04986】Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

链接https://arxiv.org/abs/2606.04986

作者:Yu Zhu,Yongkang Li,Wenjie Zhu,Haoyi Jiang,Wenyu Liu,Wei Yang,Bin Li,Xinggang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent studies, explored Vision-Language Models, studies have explored, explored Vision-Language, Recent

备注

点击查看摘要

Abstract:Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

19. 【2606.04970】Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

链接https://arxiv.org/abs/2606.04970

作者:Kaustav Kundu,Ritvik Shrivastava,Maxim Arap,Nanshu Wang,Xianhui Zhu,Quintin Fettes,Gautam Tiwari,Parth Suresh,Théo Moutakanni,Alejandro Castillejo Munoz,Allen Bolourchi,Pascale Fung,Pinar Donmez,Babak Damavandi,Anuj Kumar,Seungwhan Moon

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:proactive multi-modal assistant, multi-modal assistant system, textbf, autonomously deciding, multi-modal assistant

备注: 53 pages, 14 figures

点击查看摘要

Abstract:We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

20. 【2606.04925】Scene-Centric Unsupervised Video Panoptic Segmentation

链接https://arxiv.org/abs/2606.04925

作者:Christoph Reich,Oliver Hahn,Nikita Araslanov,Laura Leal-Taixé,Christian Rupprecht,Daniel Cremers,Stefan Roth

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically consistent regions, VPS, unsupervised VPS, aims to jointly, jointly detect

备注: CVPR 2026. Oliver Hahn and Christoph Reich - both authors contributed equally. Code: [this https URL](https://github.com/visinf/cups/tree/main/videocups) Project page: [this https URL](https://visinf.github.io/videocups/)

点击查看摘要

Abstract:Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.

21. 【2606.04922】Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

链接https://arxiv.org/abs/2606.04922

作者:Tran Dinh Tien,Zhiqiang Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:clinical data sensitivity, data sensitivity favors, sensitivity favors frozen, favors frozen backbones, Current prompt-based

备注: Preprint. Code is available at [this https URL](https://github.com/tientrandinh/OGKD)

点击查看摘要

Abstract:Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at this https URL.

22. 【2606.04920】oward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling

链接https://arxiv.org/abs/2606.04920

作者:Chin-Yuan Yeh,Ting-An Chen,De-Nian Yang,Ming-Syan Chen

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Quantizing deep neural, deep neural networks, Quantizing deep, resource-constrained devices, deep neural

备注

点击查看摘要

Abstract:Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.

23. 【2606.04911】BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

链接https://arxiv.org/abs/2606.04911

作者:Yang Liu,Jiajin Zhang,Danyang Tu,Yaojun Hu,Jiao Qu,Jiuyu Zhang,Yu Shi,Wei Fang,Shi Gu,Ling Zhang,Yingda Xia

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:mortality among women, Breast cancer remains, remains a leading, cancer-related mortality, textit

备注

点击查看摘要

Abstract:Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at this https URL.

24. 【2606.04898】CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection

链接https://arxiv.org/abs/2606.04898

作者:Roberto Di Via,Irina Voiculescu,Francesca Odone,Vito Paolo Pastore

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, image analysis supporting, interventional workflows, Anatomical landmark detection, analysis supporting

备注: Accepted MICCAI 2026

点击查看摘要

Abstract:Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.

25. 【2606.04891】Hierarchical Space Partition for Surface Reconstruction

链接https://arxiv.org/abs/2606.04891

作者:Minjie Tang,Xiangfei Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

关键词:Generating compact polygonal, Generating compact, vision and computer, computer graphics, point clouds

备注: Published in 2026 International Conference on 3D Vision (3DV)

点击查看摘要

Abstract:Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at this https URL.

26. 【2606.04888】HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios

链接https://arxiv.org/abs/2606.04888

作者:Yinxiang Yu,Maoxiang Chu,Qi Niu,Guanghu Liu,Wei Xu,Haotian Wang,Zhi Chen,Yutian Zhu,Yuelong Fan,Guanghao Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional Chinese Medicine, Traditional Chinese, Chinese Medicine, Artificial Intelligence Ocular, Intelligence Ocular Auxiliary

备注: Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at this https URL.

27. 【2606.04881】DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

链接https://arxiv.org/abs/2606.04881

作者:Yueying Zou,Peipei Li,Qianrui Teng,Dianyan Xu,Zekun Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:long-term biometric analysis, forensic identity analysis, biometric analysis, Face aging, Face aging plays

备注: 11 pages,10 figures, 5 tables

点击查看摘要

Abstract:Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

28. 【2606.04880】MAOAM: Unified Object and Material Selection with Vision-Language Models

链接https://arxiv.org/abs/2606.04880

作者:Jaden Park,Valentin Deschaintre,Jason Kuen,Kangning Liu,Iliyan Georgiev,Krishna Kumar Singh,Yong Jae Lee,Michael Fischer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Selection, core operation, operation in interactive, material, interactive image editing

备注: Accepted to SIGGRAPH 2026 Conference. Project page: \href{ [this https URL](https://jadenpark0.github.io/project_pages/maoam/) }{here}

点击查看摘要

Abstract:Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

29. 【2606.04871】Recent Advances and Trends in Learning-based 3D Representations

链接https://arxiv.org/abs/2606.04871

作者:Adrien Schockaert,Hamid Laga,Hazem Wannous,Vincent Magnier,Guillaume Dufaye,Jean-françois Witz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental design decision, modern computer vision, dictates the efficiency, novel-view synthesis, shape and motion

备注

点击查看摘要

Abstract:The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.

30. 【2606.04863】IRIS-GAN: Staged Specialist Detection of Deepfake Faces

链接https://arxiv.org/abs/2606.04863

作者:Jaume M. Trenchs,Veronica Sanz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cross-generator shift, synthetic face images, images under cross-generator, introduce IRIS-GAN, Abstract

备注: 20 pages, 10 figures

点击查看摘要

Abstract:We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.

31. 【2606.04847】MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

链接https://arxiv.org/abs/2606.04847

作者:Kun Cheng,Songshuo Lu,Sicong Liao,Tankun Li,Yafei Zhang,Dong Yang,Qiheng Lv,Hua Wang,Zhi Chen,Yaohua Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:efficient low-level code, turns high-level tensor, high-level tensor programs, Native GPU kernel, generation turns high-level

备注

点击查看摘要

Abstract:Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

32. 【2606.04844】Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

链接https://arxiv.org/abs/2606.04844

作者:Tu Vo,Sheir Zaheer,Chan Y. Park

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive audio-language models, CLAP enable zero-shot, Contrastive audio-language, zero-shot audio classification, enable zero-shot audio

备注

点击查看摘要

Abstract:Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

33. 【2606.04836】3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

链接https://arxiv.org/abs/2606.04836

作者:Inam Qadir,Elizabeth B Varghese,Dena Al-Thani,Marwa Qaraqe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate Autism Spectrum, Autism Spectrum Disorder, Accurate Autism, Spectrum Disorder, Autism Spectrum

备注

点击查看摘要

Abstract:Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

34. 【2606.04820】OA-CutMix: Correcting the Label Bias of CutMix

链接https://arxiv.org/abs/2606.04820

作者:Tobias Christian Nauen,Stanislav Frolov,Federico Raue,Brian B. Moser,Andreas Dengel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:pasted patch faithfully, patch faithfully reflects, label assignment rests, standard mixing augmentation, facto standard mixing

备注

点击查看摘要

Abstract:CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

35. 【2606.04811】Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

链接https://arxiv.org/abs/2606.04811

作者:Rui Zhao,Kaiming Yang,Jifeng Zhu,Siyang Chen,Ziqi Wang,Weijia Wu,Kevin Qinghong Lin,Heng Wang,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visually compelling content, made impressive strides, synthesizing visually compelling, outputs remain confined, http URL

备注

点击查看摘要

Abstract:Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce this http URL, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, this http URL synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. this http URL will be open-sourced at this https URL.

36. 【2606.04806】NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

链接https://arxiv.org/abs/2606.04806

作者:Sichao Li,Sai Ma,Daniel Kilov,Secil Yanik Guyot,Zhuang Li,Seth Lazar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:making normative competence, normative competence critical, LLMs and agentic, social environments, increasingly deployed

备注

点击查看摘要

Abstract:LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

37. 【2606.04801】Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find, Pruning, and Lookup Tables

链接https://arxiv.org/abs/2606.04801

作者:Titouan Le Breton,Karol Szustakowski,Marie Piraud

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present Flash Cubical, present Flash, Flash Cubical, cubical complexes, Cubical

备注

点击查看摘要

Abstract:We present Flash Cubical, a highly efficient computation of cubical persistence on a V-filtration for 2D and 3D images over $\mathbb{F}_2$. The implementation is built around three core ideas. First, cubical complexes satisfy properties that allow for the computation of persistence of the highest dimension via union-find and duality. Second, pruning of certain edges allows for a fast and efficient implementation of union-find. Third, the use of a lookup table, which exploits the regularity of cubical complexes to pre-compute local information. This avoids the need to compute local information at run time. To the best of our knowledge, this is the most efficient implementation of cubical persistence with a V-filtration, both in terms of time and memory costs. Although the paper focuses on persistence for V-filtration cubical complexes, the underlying ideas generalise naturally to T-filtrations on cubical complexes and suggest promising directions for other complexes.

38. 【2606.04797】Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization

链接https://arxiv.org/abs/2606.04797

作者:Jiahua Dong,Wenqi Liang,Hongliu Li,Yang Cong,Duzhen Zhang,Hanbin Zhao,Henghui Ding,Yulun Zhang,Salman Khan,Fahad Shahbaz Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Custom diffusion models, garnered significant interest, significant interest owing, Custom diffusion, generating personalized concepts

备注: Accepted to Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

点击查看摘要

Abstract:Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.

39. 【2606.04792】A Pathology Foundation Model for Gastric Cancer with Real-World Validation

链接https://arxiv.org/abs/2606.04792

作者:Ling Liang,Jiabo Ma,Zhengyu Zhang,Fengtao Zhou,Yingxue Xu,Yihui Wang,Cheng Jin,Zhengrui Guo,On Ki Tang,Zhijian Cen,Zhen Wang,Qi Xie,Chengyu Lu,Chenglong Zhao,Feifei Wang,Yu Cai,Hongyi Wang,Jing Zhang,Yaping Ye,Shijun Sun,Shenglei Li,Yu Wang,Zhenhui Li,Ronald Cheong Kin Chan,Xiuming Zhang,Zhe Wang,Hao Chen,Li Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gastric cancer remains, molecular heterogeneity complicates, heterogeneity complicates diagnosis, gastric cancer care, Gastric cancer

备注

点击查看摘要

Abstract:Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and Clinical dEcision support. GRACE was developed from multicenter gastric pathology datasets totaling 48,364 primarily HE-stained whole-slide images from 37,493 patients. When evaluated on 28 clinically relevant tasks, GRACE consistently outperformed representative pancancer PFMs, achieving a macro-AUC of 0.9188, with strong performance for precancerous lesion diagnosis (macro-AUC 0.9322), tumor histopathological assessment (macro-AUC 0.9119), molecular profiling (macro-AUC 0.8682), and prognostic prediction. Beyond benchmarking, GRACE's translational value was substantiated through a rigorous evidence chain. Under safety-gated criteria requiring 100% NPV for rule-out and 100% PPV for rule-in, GRACE streamlined review for up to 69.6% of malignancy-diagnosis cases and triaged 46.8% of MMR-IHC follow-up requests. This translational feasibility was further strengthened by a randomized crossover reader study of pathologist-AI collaboration. With GRACE assistance, diagnostic accuracy improved from 82.0% to 89.9%, yielding nearly twofold higher adjusted odds of a correct diagnosis (OR 1.987) alongside concurrent gains in sensitivity and specificity. AI assistance also reduced diagnostic time by 14.9%, elevated diagnostic confidence by 9.0%, and markedly improved inter-rater agreement. When calibrated to maintain non-inferior performance to senior pathologists, the AI-assisted workflow could triage 60.7% of atrophy and 82.7% of intestinal metaplasia cases.

40. 【2606.04788】Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

链接https://arxiv.org/abs/2606.04788

作者:Ayumi Umemura,Toshinori Kuwahara,Marc Pollefeys,Daniel Barath

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Toggle, Toggle Hugging Face, Explorer Toggle Bibliographic, Bibliographic Explorer Toggle, Toggle Bibliographic Explorer

备注

点击查看摘要

Abstract:Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Cite as:
arXiv:2606.04788 [cs.CV]

(or
arXiv:2606.04788v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.04788

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ayumi Umemura Umemura [view email] [v1]
Wed, 3 Jun 2026 12:14:24 UTC (3,549 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives, by Ayumi Umemura and 3 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-06

Change to browse by:

cs
cs.RO

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

41. 【2606.04775】Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

链接https://arxiv.org/abs/2606.04775

作者:Jihoon Hong,Alice Chan,Qiyue Dai,Julian Skifstad,Glen Chou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY); Optimization and Control (math.OC)

关键词:large-scale web data, generate undesired content, reduce harmful outputs, models trained, trained on large-scale

备注

点击查看摘要

Abstract:Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

42. 【2606.04773】NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

链接https://arxiv.org/abs/2606.04773

作者:Yong Cao,Chuqiao Li,Xianghui Xie,Gerard Pons-Moll,Andreas Geiger

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:human motion understanding, Reliable evaluation, human motion, motion understanding, understanding is fundamental

备注: 23 pages, 8 figures, 9 tables

点击查看摘要

Abstract:Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's \kappa=0.70) but break down on fine-grained, part-level judgment (\kappa=0.10), validating the paradigm in its strong regime while clarifying its limits.

43. 【2606.04772】Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

链接https://arxiv.org/abs/2606.04772

作者:Hoang-Son Vo,Van-Hung Bui,Minh-Huy Mai-Duc,Tien-Dung Mai,Soo-Hyung Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Understanding the relationship, human visual system, computational neuroscience, human visual cortex, deep visual representations

备注

点击查看摘要

Abstract:Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

44. 【2606.04767】Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

链接https://arxiv.org/abs/2606.04767

作者:Chong Zhang,Xiang Li,Jia Wang,Qiufeng Wang,Xiaobo Jin

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, Fisher Information Matrix, existing evaluation methods, safety-critical deployments, lack interpretability

备注: 35 pages, 1 figure

点击查看摘要

Abstract:The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: this https URL.

45. 【2606.04764】Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma

链接https://arxiv.org/abs/2606.04764

作者:Dilakshan Srikanthan,Amoon Jamzad,Paul Wilson,Nooshin Maghsoodi,Robert Policelli,Gabor Fichtinger,John F. Rudan,Parvin Mousavi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:biology remains unknown, genuine biology remains, pathology foundation models, capture genuine biology, remains unknown

备注

点击查看摘要

Abstract:Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five pathology foundation models (CONCH v1.5, UNI v2, Virchow2, GigaPath, H-Optimus-1) and a ResNet50 baseline. Using attention-based multiple instance learning, we train single-task and multi-task models to predict five molecular alterations in glioblastoma on the CPTAC cohort, validate on an independent TCGA cohort, and evaluate biological coherence of attention maps against 87 transcriptional signatures using co-registered Visium spatial transcriptomics data from 18 samples. Internally, no single encoder dominates across all tasks, and external validation inverts internal performance rankings. Attention maps show a five-fold enrichment gradient from pathways (Cohen's d=0.329) to individual genes (d=0.055), indicating that attention captures emergent multi-gene transcriptional programs rather than individual molecular events. Spatially smooth attention maps do not imply biological coherence, and different encoders attend to distinct biological compartments. Our framework provides objective, quantitative assessment of what foundation models learn from histopathology, moving the field beyond qualitative saliency map review.

46. 【2606.04737】Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

链接https://arxiv.org/abs/2606.04737

作者:Cong Wang,Hanxin Zhu,Jiayi Luo,Yonglin Tian,Xiaoqian Cheng,Peiyan Tu,Xin Jin,Long Chen,Zhibo Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale video generation, made remarkable progress, Large-scale video, video generation models, visually convincing

备注

点击查看摘要

Abstract:Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

47. 【2606.04722】StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

链接https://arxiv.org/abs/2606.04722

作者:Weiru Wang,Susanne G.H. Olthuis,Elizaveta Lavrova,Robert J. van Oostenbrugge,Charles B.L.M. Majoie,Wim H. van Zwam,Ruisheng Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:major global disease, global disease, major global, Ischemic stroke, acute ischemic stroke

备注: Early accepted at MICCAI 2026

点击查看摘要

Abstract:Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: 4.5 h, 4.5-6 h, and 6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at this https URL.

48. 【2606.04710】Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification

链接https://arxiv.org/abs/2606.04710

作者:Maitreya Shelare,Atharva Satam,Poonam Sonar,Sneha Burnase

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Complex Feature Fusion, Feature Fusion Network, Attention-Based Dual-Branch Complex, Dual-Branch Complex Feature, Fusion Network

备注: 10 pages, 3 figures

点击查看摘要

Abstract:This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.

49. 【2606.04706】ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

链接https://arxiv.org/abs/2606.04706

作者:Xiaojing Chen(1),Xinyu Lu(1),Changtao Miao(2),Yunfeng Diao(3) ((1) Anhui University, (2) Ant Group, (3) Hefei University of Technology)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:AI-generated video detection, content authenticity, increasingly realistic, raising serious concerns, concerns about misinformation

备注

点击查看摘要

Abstract:AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

50. 【2606.04705】Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

链接https://arxiv.org/abs/2606.04705

作者:Amirhossein Movahedisefat,Amirreza Fateh,Mohammad Reza Mohammadi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging task due, Box Predictor, Semantic segmentation, critical yet challenging, challenging task

备注

点击查看摘要

Abstract:Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at this https URL

51. 【2606.04701】Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

链接https://arxiv.org/abs/2606.04701

作者:Jiashu Yao,Heyan Huang,Daiqing Wu,Wangke Chen,Huaxi Ai,Haoyu Wen,Zeming Liu,Yuhang Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:agents today assume, GUI agents today, static screen, GUI agents, today assume

备注: preprint

点击查看摘要

Abstract:GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at this https URL.

52. 【2606.04700】A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound

链接https://arxiv.org/abs/2606.04700

作者:Ron Keuth,Christoph Großbröhmer,Franziska Halm,Miriam Johann,Anne-Nele Schröder,Ludger Tüshaus,Mattias P. Heinrich,Lasse Hansen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical image analysis, key quantitative parameter, treatment planning, medical image, image analysis

备注: Code and annotations for fracture angle assessment in radiographs: [this https URL](https://github.com/multimodallearning/RobustBonePoseEstimation)

点击查看摘要

Abstract:Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of $4.1^\circ$, $5.4^\circ$, and $5.51^\circ$, respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.

53. 【2606.04699】Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

链接https://arxiv.org/abs/2606.04699

作者:Yogesh Kumar,Vrushank Ahire,Mudasir Ganaie

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Support Vector Machine, Alzheimer disease, Proximal Support Vector, Early and accurate, detection of Alzheimer

备注

点击查看摘要

Abstract:Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

54. 【2606.04688】MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

链接https://arxiv.org/abs/2606.04688

作者:Jiale Xu,Wang Zhao,Ying Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language-modeling fashion, gained attention, attention by tokenizing, Autoregressive mesh generation, sequences and training

备注: CVPR 2026

点击查看摘要

Abstract:Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

55. 【2606.04684】Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

链接https://arxiv.org/abs/2606.04684

作者:Mirza Muhammad Mobeen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:traffic monitoring settings, dynamic traffic monitoring, Automatic License Plate, Optical Character Recognition, usage of Automatic

备注: 7 Pages, For Accessing code: [this https URL](https://github.com/) mobeen-pmo/Automatic-License-Plate-Recognition

点击查看摘要

Abstract:The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

56. 【2606.04656】Instance-Level Post Hoc Uncertainty Quantification in Object Detection

链接https://arxiv.org/abs/2606.04656

作者:Chongzhe Zhang,Zifan Zeng,Qunli Zhang,Feng Liu,Zheng Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Object detection, autonomous driving, Post hoc uncertainty, safety-critical component, component of autonomous

备注: 7 pages, 2 figures

点击查看摘要

Abstract:Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

57. 【2606.04621】MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

链接https://arxiv.org/abs/2606.04621

作者:Weiyu Li,Antoine Toisoul,Tom Monnier,Roman Shapovalov,Rakesh Ranjan,Ping Tan,Andrea Vedaldi

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:present MeshFlow, generating artist-like, https URL, mesh, Abstract

备注: CVPR2026 Highlight, Homepage: [this https URL](https://mesh-flow.github.io/) , Code: [this https URL](https://github.com/facebookresearch/meshflow)

点击查看摘要

Abstract:We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: this https URL, Code: this https URL

58. 【2606.04613】Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

链接https://arxiv.org/abs/2606.04613

作者:Alessandro Gambetti,Qiwei Han,Cláudia Soares,Hong Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:failure remain limited, Vision-Language Models, struggle when applied, remain limited, diagnose this failure

备注: 10 pages, 3 figures, 9 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: this https URL.

59. 【2606.04604】COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

链接https://arxiv.org/abs/2606.04604

作者:Zixu Li,Yupeng Hu,Zhiwei Chen,Haokun Wen,Xuemeng Song,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:targets locating specific, locating specific images, Composed Image Retrieval, Composed Image, targets locating

备注: Accepted by IEEE TIP 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at this https URL

60. 【2606.04593】4D Reconstruction from Sparse Dynamic Cameras

链接https://arxiv.org/abs/2606.04593

作者:Kazuki Ozeki,Shun Kenney,Yuto Shibata,Eisuke Takeuchi,Takuya Narihira,Kazumi Fukuda,Ryosuke Sawata,Yuki Mitsufuji,Yoshimitsu Aoki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains fundamentally limited, recently advanced, depth ambiguity, fundamentally limited, limited by depth

备注: Accepted by 4DV Workshop at CVPR 2026

点击查看摘要

Abstract:Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.

61. 【2606.04591】Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

链接https://arxiv.org/abs/2606.04591

作者:Hanbo Bi,Zhiqiang Yuan,Chongyang Li,Qiwei Yan,Zexi Jia,Jiapei Zhang,Xiaoyue Duan,Yingchao Feng,Jinchao Zhang,Jie Zhou

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-modal communication platforms, dialogues interleaving text, communication platforms, increasingly common, widespread adoption

备注

点击查看摘要

Abstract:With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.

62. 【2606.04545】Impostor: An Agent-Curated Benchmark for Realistic AIGC Manipulation Localization

链接https://arxiv.org/abs/2606.04545

作者:Zhenliang Li(1),Yutao Hu(1),Qixiong Wang(2),Wenpeng Du(1),Hongxiang Jiang(2),Jiasong Wu(1),Xiaolong Jiang(2),Jungong Han(3) ((1) Southeast University, (2) Xiaohongshu Inc., (3) Tsinghua University)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:localized image manipulation, image manipulation detection, image manipulation, advances in generative, controllability of localized

备注: 10 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Recent advances in generative image editing have improved the realism and controllability of localized image manipulation, raising new challenges for image manipulation detection and localization (IMDL). However, existing IMDL benchmarks still have limitations in visual realism, manipulation diversity, and generator coverage, making it difficult to reflect recent trends in image manipulation. To address these limitations, we introduce Impostor, a high-quality AI-edited image manipulation localization dataset containing 100K manipulated images. Impostor is constructed by CraftAgent, a closed-loop agent framework that integrates scene perception, editing planning, manipulation execution, quality validation, and iterative reflection to automatically generate diverse and visually realistic manipulated images. Moreover, Impostor contains images generated by seven recent AIGC models across three manipulation types and includes multiple manipulated regions, providing a more comprehensive benchmark for AIGC-based IMDL. Furthermore, we propose PhaseAware-Net (PANet), a semantic-forensic framework that introduces local phase modeling and semantic-forensic consistency learning to better localize semantically plausible yet forensically disrupted manipulated regions. Extensive experiments show that Impostor poses significant challenges to existing large vision-language models (LVLMs) and specialized IMDL methods, while PANet achieves superior performance on Impostor and multiple public benchmarks.

63. 【2606.04528】Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning

链接https://arxiv.org/abs/2606.04528

作者:Fan Zhang,Sijin Zheng,Fei Ma,Qiang Yin,Yongsheng Zhou,Fei Gao,Xian Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Few-shot class-incremental learning, synthetic aperture radar, aperture radar imagery, radar imagery presents, imagery presents unique

备注: 16 pages, 6 figures

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and SAR-specific variability. In particular, strong azimuth sensitivity in SAR induces large intra-class variation and inter-class confusion, and FSCIL sequential updates further lead to catastrophic forgetting of previously learned classes. Inspired by neural collapse, we propose an optical-guided SAR FSCIL framework, which derives orthogonal feature subspaces from a data-rich optical ATR dataset and uses them as geometric priors to guide SAR feature learning. SAR features are projected onto these orthogonal subspaces via principal angle constraints, effectively transferring discriminative structure from the optical to the SAR domain. Specifically, our projection loss and the classifier loss optimized with a frozen simplex-ETF geometry jointly induce neural collapse by concentrating features around class means while maintaining large inter-class angles. We evaluate the approach on a benchmark comprising an optical ATR dataset and a SAR ATR dataset with 24 target classes, organized into a base training session and seven incremental sessions. Compared with recent FSCIL methods including NCFSCIL and so on, our method achieves the highest final accuracy and a favorable trade-off between final performance and performance degradation. Moreover, neural collapse metrics show improved intra-class compactness and inter-class separability, indicating that the learned features more closely approximate the ideal simplex-ETF geometry.

64. 【2606.04527】Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

链接https://arxiv.org/abs/2606.04527

作者:Yuxuan Bian,Zeyue Xue,Songchun Zhang,Shiyi Zhang,Weiyang Jin,Yaowei Li,Junhao Zhuang,Haoran Li,Jie Huang,Haoyang Huang,Nan Duan,Qiang Xu

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:present Echo Infinity, Echo Infinity, compress any-length history, present Echo, dynamically filter

备注: Website: [this https URL](https://echo-team-joy-future-academy-jd.github.io/Echo-Infinity/)

点击查看摘要

Abstract:We present Echo Infinity, an autoregressive (AR) framework towards real-time infinite video generation that employs a learnable evolving memory to dynamically filter, abstract, and compress any-length history at constant cost. Existing methods mainly curate memory with predefined KV-cache schedules, fixed-ratio heuristic compression, or inference-time RoPE adaptation. These designs inevitably lose historical information and amplify compounding errors due to their limited cache window and ignorance of autoregressive generation noise. Inspired by human memory consolidation, Echo-Infinity replaces handcrafted memory curation with learnable Memory Query, which are updated by attention and a gating mechanism when past frames are evicted from the local window. The queries are optimized end-to-end with the video diffusion transformers (DiTs), forming an evolving memory that supports arbitrary compression ratios with constant computation independent of video length. They also act as a generalizable generation prior, improving quality even when only the optimized initial state is used. We further introduce Unified Relative RoPE Recipe, which anchors the sink frames to start from id 0 and lets the newest frame id grow at most to the DiTs' pretrained maximum temporal RoPE id throughout training and inference, freeing the model from the finite RoPE constraint and closing the train-test RoPE extrapolation gap. In long and short video generation, Echo-Infinity achieves state-of-the-art performance, and, to our knowledge, demonstrates promising 24-hour (1.3 M frames) real-time rollouts for the first time, suggesting a practical path toward infinite video generation.

65. 【2606.04493】SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning

链接https://arxiv.org/abs/2606.04493

作者:Zhihua Wang,Yanping Li,Yizhang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Correspondence pruning aims, Graph Neural Network, aims to identify, initial set, correspondence pruning network

备注

点击查看摘要

Abstract:Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at this https URL.

66. 【2606.04480】IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

链接https://arxiv.org/abs/2606.04480

作者:Haoyang Ge,Jian Ma,Ziwen Wang,Qihe Wang,Jianqi Fan,Hongzhi Yu,Xingyu Chen,Kun Li

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:High-quality dynamic human, High-quality dynamic, human behavior mastery, precise motion kinematics, enable human behavior

备注

点击查看摘要

Abstract:High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

67. 【2606.04479】Evaluating Reasoning Fidelity in Visual Text Generation

链接https://arxiv.org/abs/2606.04479

作者:Jiajun Hong,Jiawei Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:enabling applications including, render highly legible, applications including document, including document generation, enabling applications

备注: Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

点击查看摘要

Abstract:Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

68. 【2606.04469】Adaptive Calibration for Fair and Performant Facial Recognition

链接https://arxiv.org/abs/2606.04469

作者:Ryan Brown,Chris Russell

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:introduce Adaptive Calibration, maps cosine similarity, Adaptive Calibration, introduce Adaptive, Adaptive Calibration corrects

备注

点击查看摘要

Abstract:We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2606.04469 [cs.CV]

(or
arXiv:2606.04469v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.04469

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
69. 【2606.04461】ChannelTok: Efficient Flexible-Length Vision Tokenization

链接https://arxiv.org/abs/2606.04461

作者:Sukriti Paul,Arpit Bansal,Tom Goldstein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-step generative decoders, Leading flexible vision, tokenizers achieve SOTA, extreme cost, relying on parameter-heavy

备注

点击查看摘要

Abstract:Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: this https URL

70. 【2606.04457】Imagine Before You Draw: Visual Prompt Engineering for Image Generation

链接https://arxiv.org/abs/2606.04457

作者:Liyu Jia,Fengda Zhang,Jiachun Pan,Kesen Zhao,Saining Zhang,Wang Lin,Weijia Wu,Yue Liao,Aojun Zhou,Hanwang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Incorporating visual semantic, Incorporating visual, visual semantic representations, difficulty between text, semantic

备注

点击查看摘要

Abstract:Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

71. 【2606.04453】Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection

链接https://arxiv.org/abs/2606.04453

作者:Hina Shakir,Mohammad Mohatram,Javeed Hussain,Syed Rizwan Ali,Muhammad Irfan Memon

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:quantitative imaging biomarkers, Radiomics enables extraction, computer-aided cancer diagnosis, enables extraction, extraction of quantitative

备注

点击查看摘要

Abstract:Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.

72. 【2606.04437】INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception

链接https://arxiv.org/abs/2606.04437

作者:Chen Li,Shengrong Yuan,Jialong Zuo,Xinzhong Zhu,Nong Sang,Changxin Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models make intermediate, intermediate feature fusion, feature fusion difficult, Collaborative perception extends, deploy at scale

备注

点击查看摘要

Abstract:Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 $\log_2$ communication volume, corresponding to about 16$\times$ compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.

73. 【2606.04436】3DThinkVLA: Endowing Vision-Language-Action Models with Latent 3D Priors via 3D-Thinking-Guided Co-training

链接https://arxiv.org/abs/2606.04436

作者:Jiaxin Shi,Xidong Zhang,Fucai Zhu,Zhe Li,Siyu Zhu,Weihao Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:framework that enables, reasoning, geometry perception, spatial reasoning implicitly, spatial

备注

点击查看摘要

Abstract:We propose a 3D-thinking-guided co-training framework that enables vision-language-action (VLA) models to perform 3D spatial reasoning implicitly during action prediction. Our core insight is that 3D geometry perception and 3D spatial reasoning are distinct capabilities that can be disentangled and injected at different feature hierarchies. During training, three tightly coupled components work in concert primarily within the latent space: (1) To gain geometric priors, a latent 3D geometry perception module aligns intermediate visual features with a 3D foundation model, acquiring low-level geometric cues without architectural modifications to the VLM backbone. (2) Complementing this, an online 3D reasoning distillation module mitigates the prompt-induced reasoning gap via a shared reasoning anchor token. During 3D VLM co-training, this anchor is emitted as the first output token to robustly encode spatial priors. During VLA training, it serves as an input token inserted between the task and action instructions, transferring high-level spatial thinking from explicit teacher reasoning prompts to student action prompts without chain-of-thought text generation. (3) These disentangled geometric and reasoning features are then united by a spatially augmented action integration, which jointly injects them into the action-query tokens as hierarchical spatial conditions to prevent action shortcuts. At deployment, our method retains only its lightweight adapters to perform implicit 3D reasoning, discarding the 3D foundation model and the teacher branch used for supervision. Consequently, it operates purely on 2D images without 3D sensors, external models, or explicit text generation while preventing catastrophic forgetting of the pretrained VLM, achieving state-of-the-art performance on LIBERO, LIBERO-PLUS, SimplerEnv, and real-world manipulation tasks.

74. 【2606.04434】Hyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context Learning

链接https://arxiv.org/abs/2606.04434

作者:Niloufar Alipour Talemi,Hossein Kashiani,Fatemeh Afghah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Language Models, Multimodal Large Language, Multimodal In-Context Learning, Large Language, interleaved image-text In-Context

备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Multimodal In-Context Learning (ICL) has emerged as a practical inference paradigm for Multimodal Large Language Models, where a small set of interleaved image-text In-Context Demonstrations (ICDs) conditions the model to solve new tasks. Despite its flexibility, multimodal ICL incurs high inference latency and suffers from instability due to sensitivity to demonstration formatting, ordering, and content. To address these limitations, we propose Hyper-ICL, a lightweight, training-based framework for demonstration-free multimodal ICL that reconstructs demonstration effects directly without requiring ICDs at inference time. Hyper-ICL learns a parameter-efficient low-rank logit-level adapter that calibrates attention distributions to better match demonstration-induced attention redistribution. To capture how demonstration influence varies across queries, we introduce a query-adaptive modulation mechanism that adaptively controls intervention strength at token level across layers and heads based on the current query. Finally, we propose a layer-wise hyperbolic anchor distillation loss that aligns intermediate student features to a demonstration-conditioned teacher via Lorentz geodesic distance. This loss encourages the student to reconstruct the demonstration-query relationships induced by ICDs. Extensive experiments across six different multimodal benchmarks (including VQAv2, OK-VQA, and COCO Caption) demonstrate that Hyper-ICL consistently improves accuracy and stability over vanilla ICL and existing state-of-the-art methods.

75. 【2606.04433】Stateful Visual Encoders for Vision-Language Models

链接https://arxiv.org/abs/2606.04433

作者:Zirui Wang,Junwei Yu,Adam Yala,David M. Chan,Joseph E. Gonzalez,Trevor Darrell

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:multi-turn agentic settings, Vision-language models, multi-turn agentic, visual, agentic settings

备注: Project page: [this https URL](https://statefulvisualencoders.github.io/)

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: this https URL

76. 【2606.04432】DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation

链接https://arxiv.org/abs/2606.04432

作者:Thanh-Tung Le,Yunhan Zhao,Menglei Chai,Zhengyang Shen,Zhe Cao,Danhang Tang,Xiaohui Xie,Deying Kong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high inference cost, inference cost remains, transformers have achieved, Video diffusion transformers, Video diffusion

备注: CVPR2026, Findings Track

点击查看摘要

Abstract:Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models still use a fixed number of denoising steps per frame, wasting computation on predictable frames and under-refining challenging ones. We present DSA, a confidence-guided adaptive computation framework for AR video diffusion. DSA introduces a lightweight confidence head, trained jointly with the generator under a distribution-matching distillation objective, to estimate per-frame denoising reliability. At inference, this confidence signal dynamically adjusts the number of diffusion steps: simple frames terminate early for speed, while complex frames receive additional refinement. Our method requires no extra video data, no heuristics, and little architectural modification. Experiments show that DSA achieves real-time autoregressive video generation, reaching 22.63 FPS with sub-second latency on H100 GPUs, while maintaining competitive or superior VBench quality compared to recent autoregressive and bidirectional video diffusion models. Our results demonstrate that confidence-guided adaptive sampling provides an effective and practical path toward interactive video generation.

77. 【2606.04427】Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation

链接https://arxiv.org/abs/2606.04427

作者:Bisheng Tang,Zhangfeng Ma,Chuchu Zhai,Feng Dong,Yaoqun Wu,Ammar Oad,Yifei Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains fundamentally limited, sampling-induced information loss, Image segmentation remains, segmentation remains fundamentally, pixel-wise labeling

备注: Under reviewing

点击查看摘要

Abstract:Image segmentation remains fundamentally limited by boundary ambiguity arising from sampling-induced information loss and inherent uncertainty in pixel-wise labeling. Although encoder-decoder architectures such as U-Net achieve strong performance, they often produce overconfident predictions that fail to capture transition-region ambiguity. To address this issue, we propose \textbf{NoiseUNet}, a simple yet effective framework that injects bounded perturbations into skip connections to regularize cross-scale feature fusion. This mechanism enforces robustness to local feature variations and promotes boundary-aware representations. Theoretically, the perturbation induces an implicit fuzzification effect, yielding soft, data-driven memberships without requiring explicit fuzzy modeling. We further introduce \textbf{ThyR}, a real-world thyroid ultrasound dataset with inherently ambiguous boundaries. Experiments demonstrate that NoiseUNet consistently improves both segmentation accuracy and boundary fidelity.

78. 【2606.04414】Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis

链接https://arxiv.org/abs/2606.04414

作者:Chuankai Xu,Cristiane De Carvalho Singulane,Mohammad Abuannadi,Stephen Chandler,Jeremy Slivnick,Karolina Zareba,Jane Cao,Vidya Nadig,Fabio Fernandes,Seth Uretsky,Diego Perez de Arenaza,Amit Patel,Jianxin Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Multi-view cardiac magnetic, complementary anatomical information, noninvasive disease assessment, cardiac magnetic resonance, magnetic resonance

备注

点击查看摘要

Abstract:Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (MMs, MMs2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.

79. 【2606.04410】Ultra-Fast Neural Video Compression

链接https://arxiv.org/abs/2606.04410

作者:Jiahao Li,Wenxuan Xie,Zhaoyang Jia,Bin Li,Zongyu Guo,Xiaoyi Zhang,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:superior compression ratio, demonstrated superior compression, prohibitive computational complexity, computational complexity remains, neural video codecs

备注: CVPR 2026

点击查看摘要

Abstract:While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at this https URL.

80. 【2606.04409】An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

链接https://arxiv.org/abs/2606.04409

作者:Luoyidi Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Modern deep neural, deep neural networks, Modern deep, large parameter scales, achieved strong performance

备注: 12 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: this https URL.

81. 【2606.04385】Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

链接https://arxiv.org/abs/2606.04385

作者:Shuwen Yu,Zhanxuan Hu,Yi Zhao,Yonghang Tai,Huafeng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision-language foundation models, driven rapid progress, vision-only foundation models, vision-language foundation, Foundation models

备注: Accepted at ICML 2026

点击查看摘要

Abstract:Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at this https URL

82. 【2606.04373】Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

链接https://arxiv.org/abs/2606.04373

作者:Biao Qian,Yang Wang,Yong Wu,Jungong Han

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:addresses data security, accessing real data, data security concerns, synthetic samples, addresses data

备注: Accepted to appear at ICML 2026, Seoul, Korea

点击查看摘要

Abstract:Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at this https URL.

83. 【2606.04369】VT-3DAD: Cross-Category 3D Anomaly Detection via Visual-Text Normal Space Alignment

链接https://arxiv.org/abs/2606.04369

作者:Zi Wang,Katsuya Hotta,Yawen Zou,Koichiro Kamide,Yijin Wei,Chao Zhang,Jun Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:anomaly detection aims, unknown point cloud, point cloud belongs, aims to determine, normal references

备注

点击查看摘要

Abstract:Few-shot cross-category 3D anomaly detection aims to determine whether an unknown point cloud belongs to a target normal category using only a few normal references. Existing training-based methods usually require category-wise optimization, while recent training-free methods based on multi-view CLIP visual features mainly rely on visual similarity and may be confused by geometrically similar categories. In this paper, we propose VT-3DAD, a training-free framework for cross-category 3D anomaly detection via Visual-Text Normal Space Alignment. Given few-shot normal references and a test point cloud, VT-3DAD first generates realistic multi-view depth maps and extracts view-wise features using a frozen CLIP visual encoder. The visual branch measures reference-test deviation in the multi-view feature space. In parallel, depth-aware and 3D-aware prompts are encoded by the frozen CLIP text encoder to construct textual normal anchors, which provide semantic normality constraints for the target category. The final anomaly score is obtained by fusing visual deviation from normal references and semantic deviation from the textual normal space. Experiments on the ShapeNetPart dataset demonstrate that VT-3DAD achieves state-of-the-art performance. In particular, VT-3DAD improves the one-shot average AUC-ROC from 92.49% to 94.80% compared with the visual-only baseline, while also reducing the average standard deviation from 5.64 to 3.41.

84. 【2606.04365】Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes

链接https://arxiv.org/abs/2606.04365

作者:Renjie Liang,Zhengkang Fan,Jinqian Pan,Chenkun Sun,Jiang Bian,Russell Terry,Jie Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Radiology reports describe, describe kidney lesions, methods predict, organ level, reports describe kidney

备注

点击查看摘要

Abstract:Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient or organ level. We reformulate kidney CT characterization as a per-lesion set-prediction task: one model emits a variable number of lesions per kidney, each with four clinical attributes. We curated 2,619 CT volumes from 788 patients at one academic medical center, with multi-granularity side- and per-lesion labels, and used KiTS23 (489 cases) for zero-shot external validation. We propose \textbf{LesionDETR}, a DETR-style architecture with size-distance Hungarian matching and a hierarchical loss that aggregates per-slot outputs to side-level objectives. Across four input representations and six encoder initializations, two design choices dominate: a segmentation mask as an input channel, and same-domain abdominal pretraining (SuPreM); generic large-corpus pretraining is no better than random initialization. LesionDETR reaches bilateral side-level abnormality AUC $0.799 \pm 0.009$ on UF-Health and $0.817 \pm 0.072$ on KiTS23. A count-conditioned variant reaches per-lesion mAP $0.190 \pm 0.083$ on cystic lesions; rare solid-lesion AP stays at the noise floor, pointing to targeted data collection, not architecture, as the next bottleneck. The framework yields verified per-lesion predictions for downstream structured report generation.

85. 【2606.04364】Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

链接https://arxiv.org/abs/2606.04364

作者:Dhanesh Ramachandram

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Concept bottleneck models, predict a layer, predicting a class, decisions auditable, Concept bottleneck

备注

点击查看摘要

Abstract:Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

86. 【2606.04351】Video2LoRA: Parametric Video Internalization for Vision-Language Models

链接https://arxiv.org/abs/2606.04351

作者:Manan Suri,Sarvesh Baskar,Dinesh Manocha

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:frame occupies hundreds, Processing video, occupies hundreds, Processing, video

备注

点击查看摘要

Abstract:Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Video2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Video2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Video2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Video2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

87. 【2606.04349】MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

链接https://arxiv.org/abs/2606.04349

作者:Yue Wu,Changyuan Wang,Zixuan Wang,Shilin Ma,Yansong Tang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Omni-modal Large Language, Large Language Models, Conventional Post-Training Quantization, Omni-modal Large, Large Language

备注

点击查看摘要

Abstract:Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

88. 【2606.04345】HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning

链接https://arxiv.org/abs/2606.04345

作者:Isha Abid,Fawad Khan,Muhammad Khuram Shahzad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:paper presents HYolo, YOLO architecture, integrates hypergraph learning, paper presents, framework that integrates

备注: 8 pages, multiple figures;

点击查看摘要

Abstract:This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architecture. Traditional YOLO-based object detection models primarily capture pairwise feature interactions and may fail to model complex high-order relationships among objects and contextual features. To address this limitation, HYolo incorporates hypergraph learning to capture richer contextual dependencies and improve object representation. Experimental evaluation on the COCO dataset demonstrates significant performance improvements over baseline YOLO models. The proposed approach achieves approximately 12% improvement in mAP@50 while enhancing overall detection accuracy and robustness. By modeling high-order feature relationships, HYolo provides improved contextual understanding and more reliable object detection performance in IoT-based environments. The results indicate that integrating hypergraph learning into object detection pipelines offers a promising direction for intelligent and context-aware IoT vision systems.

89. 【2606.04343】Robust Multi-view Clustering against Imperfect Information

链接https://arxiv.org/abs/2606.04343

作者:Zhichao Huang,Haochen Zhou,Hao Wang,Mouxing Yang,Xi Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-world multi-view data, imperfect information problem, Noisy Correspondences, information problem, Incomplete Views

备注: 19 pages, 11 figures

点击查看摘要

Abstract:Real-world multi-view data always suffer from imperfect information problem, where the view-specific observations are absent (i.e., Incomplete Views, IV) and cross-view correspondences are mismatched (i.e., Noisy Correspondences, NC) for certain instances. As a remedy, numerous IV- and NC-oriented multi-view clustering (MvC) methods have been proposed, which however require either reliable correspondences or sufficiently complete instances, thus stopping short of addressing the imperfect information problem. In contrast, we observe that both IV and NC challenges originate from the same issue of imperfect cross-view counterpart information, where the counterpart of an anchor instance in another view might be either unavailable or unreliable. Based on the observation, we propose a novel robust MvC framework, termed Posterior-guided Latent Counterpart Inference (PLCI), which could handle both IV and NC in a unified manner. Specifically, PLCI formulates the desired cross-view counterpart of each anchor instance as a latent variable, and integrates both instance-level reliability and prototype-level semantic transport to infer the posterior distribution of the latent counterpart. Extensive experiments on six widely-used multi-view datasets against 10 state-of-the-art MvC methods demonstrate the effectiveness of PLCI for tackling the imperfect information problem. The code will be released upon acceptance.

90. 【2606.04323】Answer Self-Consistency with Margin-Triggered Question Re-Arbitration for the CVPR 2026 VidLLMs Challenge

链接https://arxiv.org/abs/2606.04323

作者:Tomoya Miyazawa,Hiroyasu Okuno

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:VidLLMs Challenge, present our solution, track evaluates visual, average accuracy, CVPR

备注

点击查看摘要

Abstract:In this report, we present our solution for Track 2 of the CVPR 2026 VidLLMs Challenge. This track evaluates visual relational reasoning in videos, where models must infer relations that are not always explicitly visible. We propose Answer Self-Consistency with Margin-Triggered Question Re-Arbitration (ASC-MQRA), a training-free test-time reasoning framework built on a multimodal reasoning model. The core ASC component performs multiple stochastic video question-answering runs and aggregates their answer choices through answer-level self-consistency. This substantially improves over single-pass inference and forms our final test submission. We further study MQRA, a conditional re-arbitration module for low-margin examples where the first-stage vote distribution indicates uncertainty. Our vote-margin analysis shows that low-margin examples often retain the ground-truth answer among the top candidates, motivating MQRA to narrow the candidate set and re-watch the video only over the retained candidates. On validation, MQRA further improves over ASC, indicating that low-margin vote distributions can provide a useful uncertainty signal. On test, however, MQRA slightly degrades performance relative to ASC, suggesting that re-arbitration is sensitive to the size and category distribution of the triggered subset. Our final test submission therefore uses ASC without re-arbitration, achieving 72.73 average accuracy and 78.34 category-wise macro average accuracy on validation, and 81.16 average accuracy and 80.91 category-wise macro average accuracy on test. This report details our prompting strategy, implementation setup, ablation studies, and diagnostic analyses. The code is available at this https URL

91. 【2606.04319】PureLight: Learning Complex Luminaires with Light Tracing

链接https://arxiv.org/abs/2606.04319

作者:Pedro Figueiredo,Zixuan Li,Beibei Wang,Miloš Hašan,Nima Khademi Kalantari

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:propose a neural, exit surfaces, complex light transport, neural formulation, small emitters enclosed

备注: 9 pages, 10 figures

点击查看摘要

Abstract:We propose a neural formulation for estimating the appearance of complex luminaires. We focus on challenging luminaires with complex light transport (e.g., small emitters enclosed by multiple specular layers) that are difficult for (bidirectional) path tracing. To this end, we use light tracing to construct paths from emitters to the exit surfaces and formulate appearance estimation as a distribution learning problem. Specifically, we model the probability density function (pdf) of outgoing radiance on the exit surfaces using a large normalizing flow network, and recover the outgoing radiance as the product of the estimated pdf and flux. To enable efficient inference, we distill the learned appearance into a lightweight MLP that directly estimates radiance on the exit surfaces. We additionally train a sampling network for effective direct illumination computation from the luminaire, and a blending network to composite the luminaire into the scene. Our formulation makes it feasible to render challenging luminaires using low sample counts in arbitrary scenes.

92. 【2606.04301】XSSR: Cross-Domain Self-Supervised Representative Selection for Efficient Annotation in Medical Image Segmentation

链接https://arxiv.org/abs/2606.04301

作者:Byunghyun Ko,Aleksei Anisimov,Kobe Ke,Suhas Bharthepude,Jeongkyu Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Acquiring labeled medical, medical image data, labeled medical image, clinical site, Acquiring labeled

备注: Accepted to the Third International Conference on AI in Healthcare (AIiH 2026). This is the preprint version of the paper

点击查看摘要

Abstract:Acquiring labeled medical image data is resource-intensive and a challenge further exacerbated in cross-domain scenarios where source and target datasets differ in imaging equipment, population, or clinical site. This study introduces XSSR (Cross-Domain Self-Supervised Representative Selection), a framework designed to minimize annotation effort in the target domain while maintaining robust segmentation performance. XSSR comprises three stages: first, a Masked Autoencoder (MAE) is trained on unlabeled source data to establish a shared embedding space without requiring target labels; second, a greedy selection algorithm scores unlabeled target samples based on a composite density, novelty, and diversity criterion; and third, a U-Net segmentation model is trained exclusively on the selected subset. The novelty-diversity trade-off parameter, alpha, is automatically calibrated by minimizing embedding-space coverage, eliminating manual tuning. We evaluate XSSR on three public benchmarks: Chest X-ray, RIGA+ retinal fundus imaging, and multi-site Prostate MRI, each under a fixed 5% annotation budget. XSSR achieves 99.3% of full-data performance on Chest X-ray using only 22 labeled samples, surpasses random selection by up to 2.5 Dice points on Prostate MRI, and consistently outperforms the CoreSet baseline by 0.4 to 1.2 Dice points across all datasets. Ablation studies indicate that diversity is the most influential scoring component, and per-site analysis shows that performance correlates with scanner similarity to the source domain.

93. 【2606.04299】Efficient and Training-Free Single-Image Diffusion Models

链接https://arxiv.org/abs/2606.04299

作者:Haojun Qiu,Kiriakos N. Kutulakos,David B. Lindell

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:internal structure, single reference image, single reference, image, diffusion

备注: CVPR 2026; Project Page: [this https URL](https://haojunqiu.github.io/efficient-SID/)

点击查看摘要

Abstract:We consider the problem of generating images whose internal structure -- defined by the distribution of patches across multiple scales -- matches that of a single reference image. Recent approaches address this problem by training a diffusion model on a single image. But even in this setting, training is computationally expensive and requires hours of optimization. Instead, we model the image using a dataset of its patches at different scales. As this dataset is finite and the dimensionality of its patches is small, the score function for a noisy patch can be computed tractably using an optimal, closed-form denoiser, eliminating the need for neural network training. We integrate this patch-based denoiser into an efficient, training-free image diffusion model, and we describe how our method connects to classical patch-based image restoration techniques. Our approach achieves state-of-the-art generation quality and diversity compared to trained single-image diffusion models, and we demonstrate applications, including unconditional image generation, text-guided stylization, image symmetrization, and retargeting. Further, we show that our approach is compatible with latent space diffusion, and we show multiple additional acceleration techniques to achieve megapixel single-image generation in one second, and gigapixel generation in minutes.

94. 【2606.04291】A Cookbook of 3D Vision: Data, Learning Paradigms, and Application

链接https://arxiv.org/abs/2606.04291

作者:Hongyang Du,Zongxia Li,Dawei Liu,Runhao Li,Haoyuan Song,Qingyu Zhang,Yubo Wang,Jingcheng Ni,Shihang Gui,Congchao Dong,Tao Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly diverse data, rapidly evolved, driven by increasingly, diverse data representations, increasingly diverse

备注: Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. [this https URL](https://openaccess.thecvf.com/content/CVPR2026W/OpenSUN3D/html/Du_A_Cookbook_of_3D_Vision_Data_Learning_Paradigms_and_Application_CVPRW_2026_paper.html)

点击查看摘要

Abstract:3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.

95. 【2606.04282】FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

链接https://arxiv.org/abs/2606.04282

作者:Eshika Khandelwal,Jingjing Pan,Mingfang Zhang,Quan Kong,Lorenzo Garattoni,Hilde Kuehne

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual question answering, large language models, free-form vision-language tasks, Multimodal large language, question answering

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

96. 【2606.04271】StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

链接https://arxiv.org/abs/2606.04271

作者:Stepan Konev

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:map sensor inputs, sensor inputs directly, Autonomous driving, motion forecasting, shifted from modular

备注

点击查看摘要

Abstract:Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at this https URL.

97. 【2606.04269】Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

链接https://arxiv.org/abs/2606.04269

作者:Yilong Wang,Cheng Qian,Edward Johns

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deformable object manipulation, partially observable states, Deformable object, multiple valid manipulation, due to high-dimensional

备注

点击查看摘要

Abstract:Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at this https URL.

98. 【2606.04264】UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation

链接https://arxiv.org/abs/2606.04264

作者:Zeyuan Yang,Hao-Wei Chen,Xueyang Yu,Yuncong Yang,Haoyu Zhen,Ziqiao Ma,Maohao Shen,Chuang Gan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision-language models handling, remarkable progress, models, Recent years, diffusion models

备注

点击查看摘要

Abstract:Recent years have seen remarkable progress in unified vision-language models handling both multimodal understanding and generation within a single architecture. While autoregressive VLMs can reason across modalities, they fail to generate high-quality images. In contrast, diffusion models produce photorealistic visuals yet struggle to generate coherent text, making it challenging to develop a single unified model that can seamlessly handle both visual and text generation. Recent advances suggest that language can be effectively embedded within visual representations, allowing models to reason about textual semantics directly from images. To this end, we propose UniCanvas, a first attempt that unifies diffusion models to generate interleaved multimodal contents through text-in-image generation. Diffusion models naturally capture transformations on a shared pixel canvas, which can be viewed as world models of visual change. Instead of producing discrete text tokens, the model learns to represent language as visual patterns inside images, leveraging its inherent multimodal embedding space. This design allows the model to "draw" text naturally within a single pixel canvas during image synthesis, achieving seamless multimodal generation. Experiments demonstrate that UniCanvas improves performance over previous unified models, positioning text-in-image generation with diffusion models as a promising unified multimodal generation paradigm.

99. 【2606.04261】Can Generalist Agents Automate Data Curation?

链接https://arxiv.org/abs/2606.04261

作者:Feiyang Kang,Hanze Li,Adam Nguyen,Mahavir Dabas,Jiaqi W. Ma,Frederic Sala,Dawn Song,Ruoxi Jia

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:practitioners iteratively propose, Curating training data, noisy benchmark feedback, Curating training, modern AI development

备注: Preprint

点击查看摘要

Abstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

100. 【2606.04251】SBP-Net: Learning Thin Structure Reconstruction with Sliding-Box Projections

链接https://arxiv.org/abs/2606.04251

作者:Ofir Gilad,Andrei Sharf

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scale variation, Reconstructing thin, complex geometry, challenging due, Reconstructing

备注: Accepted to IEEE ICIP 2026, 6 pages, 4 figures

点击查看摘要

Abstract:Reconstructing thin 3D structures is challenging due to their sparsity, scale variation, and complex geometry. Such structures arise in a wide range of domains, including medical imaging of vascular systems and industrial pipe systems. While recent neural methods perform well on dense surfaces, they often fail to recover fine thin geometries. We propose a reconstruction approach based on local depth projections, which provide an efficient and informative 2D representation of thin structures. Specifically, we traverse the 3D model with a sliding box to generate local orthographic depth projections, which are processed by a neural network to reconstruct missing thin structures in 2D. The local reconstructions are subsequently fused back into the 3D model to produce a coherent and detailed shape. Experiments on pulmonary artery reconstruction from CT volumes and industrial pipeline recovery from synthetic and real scans demonstrate improved preservation of fine structural details over existing methods.

101. 【2606.04249】Prospective Dynamic 3D MRI Reconstruction via Latent-Space Motion Tracking from Single Measurement

链接https://arxiv.org/abs/2606.04249

作者:Lixuan Chen,Zhongnan Liu,Jesse Hamilton,James M. Balter,Jeong Joon Park,Liyue Shen

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:demands accurate image, accurate image reconstruction, MRI-guided radiotherapy, acquired measurements, demands accurate

备注

点击查看摘要

Abstract:Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

102. 【2606.04244】VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

链接https://arxiv.org/abs/2606.04244

作者:Amirhossein Dabiriaghdam,Shayan Vassef,Mohammadreza Bakhtiari,Yasamin Medghalchi,Ilker Hacihaliloglu,Mesrob Ohannessian,Lele Wang,Giuseppe Carenini

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multimodal large language, large language models, large language, increasingly capable, capable of complex

备注

点击查看摘要

Abstract:Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

103. 【2606.04240】Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

链接https://arxiv.org/abs/2606.04240

作者:Jingbiao Mei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Multimodal Document Retrieval, multimodal retrieval-augmented generation, document page retrieval, Document Retrieval Challenge, closed-set document page

备注: MDR Challenge Report at WWW2025

点击查看摘要

Abstract:Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

104. 【2606.04205】DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

链接https://arxiv.org/abs/2606.04205

作者:Sajad Ebrahimi,Nima Jamali,Bardia Shirsalimian,Kelly McConvey,Wentao Zhang,Jalehsadat Mahdavimoghaddam,Maksym Taranukhin,Maura Grossman,Vered Shwartz,Yuntian Deng,Ebrahim Bagheri

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)

关键词:growing popularity, growing body, popularity and capacity, capacity of generative, eroded the distinction

备注

点击查看摘要

Abstract:The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at this https URL, and the package can be installed via pip install detectzoo.

105. 【2606.04198】Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG

链接https://arxiv.org/abs/2606.04198

作者:Achraf Ben Ahmed

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieves low heart-rate, driver fatigue applications, low heart-rate error, compressed video channels, neonatal ICU

备注

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4's effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2606.04198 [cs.CV]

(or
arXiv:2606.04198v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2606.04198

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Achraf Ben Ahmed [view email] [v1]
Tue, 2 Jun 2026 20:33:36 UTC (236 KB)

106. 【2606.04184】GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

链接https://arxiv.org/abs/2606.04184

作者:Weidong Tang,Jierui Li,Yueling Hou,Zihan Mei,Can Zhang,Xinyan Wan,Zhiyuan Liang,Pengfei Zhou,Yang You,Wangbo Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:True general intelligence, general intelligence requires, True general, mental states interact, physical world

备注: Accepted by ACL 2026

点击查看摘要

Abstract:True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.

107. 【2606.04166】End-to-End Text Line Detection and Ordering

链接https://arxiv.org/abs/2606.04166

作者:Benjamin Kiessling(ALMAnaCH)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Practical text-recognition pipelines, source-specific editorial conventions, historical documents typically, documents typically decompose, hand-coded geometric heuristic

备注

点击查看摘要

Abstract:Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line's position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method's source code and model weights are available under an open license at this https URL.

108. 【2606.04133】Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking

链接https://arxiv.org/abs/2606.04133

作者:Nika Chuzhoy,Brian Hu,Amit A. Arora,Jae Ro,Sarthak S. Sahu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image geolocation aims, Image geolocation, aims to estimate, street-view imagery, visual content

备注

点击查看摘要

Abstract:Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed. Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline. A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible. Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).

109. 【2606.04108】SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

链接https://arxiv.org/abs/2606.04108

作者:Guangda Ji,Qimin Chen,Qinchan Li,Mingrui Zhao,Kai Wang,Hao Zhang

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:impressive visual quality, achieved impressive visual, visual quality, fall short, achieved impressive

备注

点击查看摘要

Abstract:Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation of TRELLIS.2, without retraining the underlying VAE or flow model. Our key idea is to approximate the latent-space action of spatial transformations as a learned linear operator on voxel latents, implemented as a lightweight spatial-transform latent mapper trained on generic, non-symmetric 3D data. At generation time, we enforce symmetry by averaging predicted flow velocities across all symmetry-equivalent transformations at each ODE step, a process we call velocity symmetrization. The symmetry specification can be estimated automatically from an initial TRELLIS.2 generation or supplied by the user, enabling deliberate fold manipulation beyond what the input image suggests. On a curated benchmark of 266 strictly symmetric objects spanning 2- to 20-fold rotations and polyhedral symmetry groups, SymTRELLIS substantially reduces all symmetry error metrics compared to TRELLIS.2, Hunyuan3D-2.1, and TripoSG, while maintaining reconstruction accuracy comparable to the base model.

110. 【2606.04107】Reflection Separation from a Single Image via Joint Latent Diffusion

链接https://arxiv.org/abs/2606.04107

作者:Zheng-Hui Huang,Zhixiang Wang,Yu-Lun Liu,Yung-Yu Chuang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Single-image reflection separation, Single-image reflection, highly challenging, challenging under extreme, extreme conditions

备注: CVPR 2026. Project page: [this https URL](https://brian90709.github.io/diff-reflection-separation/)

点击查看摘要

Abstract:Single-image reflection separation is highly challenging under extreme conditions like glare or weak reflections. Existing methods often struggle to recover both layers in glare or weak-reflection scenarios because of insufficient information. This paper presents a diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. Our method simultaneously generates transmission and reflection layers through a unified diffusion model, incorporating a novel cross-layer self-attention mechanism for better feature disentanglement. We further introduce a disjoint sampling strategy to iteratively reduce interference between the layers during diffusion and a latent optimization step with a learned composition function for improved results in complex real-world scenarios. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods on multiple real-world benchmarks. Project page: this https URL

111. 【2606.04098】When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection

链接https://arxiv.org/abs/2606.04098

作者:Tao Yu,Yujia Yang,Shenghua Chai,Zhang Jinshuai,Haopeng Jin,Hao Wang,Minghui Zhang,Zhongtian Luo,Yuchen Long,Xinlong Chen,Jiabing Yang,Zhaolu Kang,Yuxuan Zhou,Zhengyu Man,Xinming Wang,Hongzhu Yi,Zheqi He,Xi Yang,Yan Huang,Liang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:construct false narratives, misinformation increasingly operates, spliced across sources, Video misinformation increasingly, evidential level

备注: 52 pages

点击查看摘要

Abstract:Video misinformation increasingly operates at the semantic and evidential level: authentic footage may be selectively edited, temporally reordered, spliced across sources, or augmented with AI-generated content to construct false narratives. Such evidence-dependent manipulations cannot be reliably verified from the input video alone, because the missing, reordered, replaced, or recontextualized evidence lies outside the video itself. We introduce \textbf{EVID-Bench}, a benchmark for search-grounded video misinformation detection, where a system must search the open web for related videos and identify what information is false through cross-video comparison. EVID-Bench comprises 222 videos spanning 9 manipulation types across 3 categories: AI generation, single-source editing, and multi-source editing. All samples are verified to be undetectable by frontier models through visual inspection alone. We evaluate nine frontier multimodal models using a retrieval-augmented verification baseline. The best system achieves only 61.43\% point-level accuracy and 43.24\% video-level accuracy, while AI-generated manipulations remain especially challenging. Error analysis reveals recurring challenges: models fixate on irrelevant anchors, misattribute synthetic content to editorial splicing, and terminate search prematurely before fully explaining the manipulation.

112. 【2606.04092】Optimal Transport Flow Matching by Design

链接https://arxiv.org/abs/2606.04092

作者:Shimon Malnick,Matan Rusanovsky,Ohad Fried,Shai Avidan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:simple prior distribution, complex data distribution, prior distribution, coupling, data distribution

备注: Project page: [this https URL](https://www.malnick.net/designing_ot_flows)

点击查看摘要

Abstract:Flow matching models learn to transport samples from a simple prior distribution to a complex data distribution. When prior-data pairs are coupled via optimal transport (OT), the learned trajectories are straight and non-crossing, enabling fast, even single-step, generation. However, computing the OT coupling in high dimensions is intractable, and existing methods attempt to solve the OT problem, at the cost of persistent bias or significant overhead. Rather than solving for the OT coupling, we reformulate the problem. Once the prior is treated as a design choice rather than a fixed input, the OT coupling between prior and data is no longer unique. Many priors admit an OT-optimal identity coupling to the data, leaving us free to choose one that is also tractable to sample. We identify low-frequency projection of natural images as such a choice. The identity coupling between data and its low-frequency representation is empirically OT-optimal, the prior is structured enough to be sampled by a lightweight model at inference, and the remaining flow-matching task reduces to synthesizing high-frequency detail. Interpolating the prior with Gaussian noise further improves generation quality while preserving the OT coupling. The approach requires no modifications to the flow model itself, and integrates naturally with latent-space models, classifier-free guidance, and one-step generation frameworks. Across all benchmarks, our method reduces trajectory curvature by more than $2\times$ compared to existing flow matching methods, yielding better generation quality in the few-step regime.

113. 【2606.04061】Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal Reasoning

链接https://arxiv.org/abs/2606.04061

作者:Yang Liu,Wentao Feng,Shu-Dong Huang,Yalan Ye,Jiancheng Lv

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale web-harvested datasets, Large-scale web-harvested, degrades model generalization, severely degrades model, noisy correspondence

备注

点击查看摘要

Abstract:Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a "Discrete Selection" paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at this https URL.

114. 【2606.04060】Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration

链接https://arxiv.org/abs/2606.04060

作者:Zhonggai Wang,Kai Fang,Guangyu Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:causing newly learned, newly learned classes, Weakly Incremental Learning, progressively corrupts class-level, Semantic Segmentation

备注: Accepted by ICME2026

点击查看摘要

Abstract:Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class-level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift-resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class-level references to preserve long-term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance-specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry-aware decisions to directly filter unreliable signals and enforce a strict "one object, one class" constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, particularly in challenging multi-step incremental settings. The code is available at this https URL.

115. 【2606.04046】Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

链接https://arxiv.org/abs/2606.04046

作者:Boyuan Xiao,Bohong Chen,Yumeng Li,Ji Feng,Yao-Xiang Ding,Kun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:vision-language decision making, embodied vision-language decision, decision making tasks, manipulation and navigation, vision-language decision

备注: Accepted at ICML 2026

点击查看摘要

Abstract:In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: this https URL.

116. 【2606.03943】PointAction: 3D Points as Universal Action Representations for Robot Control

链接https://arxiv.org/abs/2606.03943

作者:Mutian Tong,Han Jiang,Qiao Feng,Lingjie Liu,Jiatao Gu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:generalizable robot manipulation, pre-trained video diffusion, broad visual dynamics, visual dynamics captured, leverage the broad

备注: Project page: [this https URL](https://oriontmt.github.io/pointaction/)

点击查看摘要

Abstract:Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

117. 【2605.13672】SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

链接https://arxiv.org/abs/2605.13672

作者:Giries Abu Ayoub,Morad Tukan,Loay Mualem

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:limited labeled data, evaluations implicitly assume, labeled data, implicitly assume, assume that target

备注

点击查看摘要

Abstract:Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

118. 【2606.04419】L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI

链接https://arxiv.org/abs/2606.04419

作者:Arda Atalık,Sumit Chopra,Daniel K. Sodickson

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:limiting scanner throughput, excellent soft-tissue contrast, MRI provides excellent, increase patient discomfort, raising exam costs

备注: Accepted to MICCAI 2026

点击查看摘要

Abstract:MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also raising exam costs and limiting scanner throughput. A common approach to reduce scan time is to acquire fewer measurements, which yields an ill-posed linear inverse problem; recovering diagnostic-quality images therefore requires incorporating prior knowledge beyond the measured data. In follow-up exams, the most recent prior scan of a patient can provide a highly informative subject-specific context, but practical use is complicated by temporal changes (including pathology progression), misalignment between scans, and protocol drift across acquisitions. In this work, we introduce L-TGVN, a Longitudinal Trust-Guided Variational Network that leverages prior scans as side information to reconstruct the current scan from heavily undersampled measurements. Crucially, L-TGVN constrains the influence of prior scans to be consistent with the acquired measurements. Unlike many existing longitudinal reconstruction methods, it does not require explicit pre-registration between prior and current scans. It further accommodates differences in acquisition protocols across visits (e.g., changes in sequence parameters). We evaluate L-TGVN against matched-capacity baselines, including prior-guided methods and methods that do not use longitudinal priors, and observe consistent improvements in standard quantitative metrics together with better preservation of fine structures at challenging accelerations. Source code is available at this http URL.

119. 【2606.03998】GSD: Topology-Guided State-Space Diffusion for EEG Spatial Super-Resolution

链接https://arxiv.org/abs/2606.03998

作者:Zijian Kang,Weiming Zeng,Yueyang Li,Shengyu Gong,Hongjie Yan,Wai Ting Siok,Nizhuan Wang

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)

关键词:cross-regional neural activity, characterize cross-regional neural, lacks sufficient spatial, EEG spatial super-resolution, IoT-based brain sensing

备注

点击查看摘要

Abstract:Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at this https URL.