本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新664篇论文,其中:

  • 自然语言处理93
  • 信息检索26
  • 计算机视觉161

自然语言处理

1. 【2607.01233】Measuring the Gap Between Human and LLM Research Ideas

链接https://arxiv.org/abs/2607.01233

作者:Ziyu Chen,Yilun Zhao,Arman Cohan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:judge individual ideas, expert preference, judge individual, brainstorm research ideas, ideas

备注

点击查看摘要

Abstract:LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.

2. 【2607.01232】Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

链接https://arxiv.org/abs/2607.01232

作者:Zijian Zhang,Rizhen Hu,Athanasios Glentis,Dawei Li,Chung-Yiu Yau,Hongzhou Lin,Mingyi Hong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Reinforcement learning, post-training large language, large language models, central component, large language

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.

3. 【2607.01224】AutoMem: Automated Learning of Memory as a Cognitive Skill

链接https://arxiv.org/abs/2607.01224

作者:Shengguang Wu,Hao Zhu,Yuhui Zhang,Xiaohan Wang,Serena Yeung-Levy

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Memory, organize knowledge, science as metamemory, cognitive science, model

备注: Project Website: [this https URL](https://autolearnmem.github.io/)

点击查看摘要

Abstract:Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.

4. 【2607.01223】heoria: Rewrite-Acceptability Verification over Informal Reasoning States

链接https://arxiv.org/abs/2607.01223

作者:Ben Slivinski,Michael Saldivar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)

关键词:answer be trusted, system answer, LLM judges offer, LLM judges, scalar LLM judges

备注

点击查看摘要

Abstract:When should an AI system's answer be trusted? Formal proof assistants offer certainty but cannot reach most of the problem distribution; scalar LLM judges offer coverage but produce opaque scores that cannot be audited after the fact and are subject to the same coherence issues as any LLM. We present Theoria, a verification architecture that closes this gap. A candidate solution is rewritten into a sequence of typed state transitions, each licensed by an explicit justification, whether that be a citation, computation, or problem-given fact, and every transition is independently auditable. The foundational invariant is completeness of change: every difference between consecutive proof states must be accounted for, so hidden premises surface as unlicensed mutations rather than passing silently. On HLE-Verified Gold (185 text-only expert problems), Theoria certifies 105 at 91.4% strict precision (Wilson 95% CI [84.5%, 95.4%]). Every certification produces a human readable proof trace in which each step can be independently challenged. Holistic LLM judges achieve comparable precision at matched coverage but fail on different problems (Jaccard 0.14-0.36), making the approaches complementary. On 95 adversarial poisoned proofs across 15 domains, structured judges catch 94.7% versus 83.2% for holistic judging (p= 0.0017). The overall 11.5 pp gap concentrates in hidden premises (90.6% vs. 62.5%, a 28 pp difference) and fabricated citations (100% vs. 90%), the error classes where the formal analysis predicts an advantage; performance is identical on arithmetic and theorem-misapplication errors, where no advantage is predicted. On GPQA Diamond (n= 65), certified precision is 97.1% (Wilson CI [85.1%, 99.5%]).

5. 【2607.01218】he State-Prediction Separation Hypothesis

链接https://arxiv.org/abs/2607.01218

作者:Giovanni Monea,Nathan Godey,Kianté Brantley,Yoav Artzi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:future token predictions, forward computation stream, token predictions, future token, store useful state

备注: Preprint

点击查看摘要

Abstract:Transformers use the same forward computation stream to both predict the next token and store useful state for future token predictions. We formulate the \emph{state-prediction separation hypothesis}: disentangling the two roles yields better language modeling performance. We design a Transformer variant that uses two computation streams to separate the two functions, and conduct pretraining experiments across various scales. Our experiments show that state-prediction separation consistently offers better data and compute efficiencies, improving validation loss and outperforming standard Transformers by 2--3 percentage points on average on downstream tasks. We also conduct extensive empirical analysis that rules out potential confounders and demonstrates the fundamental difference in the gradients our design entails.

6. 【2607.01208】Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

链接https://arxiv.org/abs/2607.01208

作者:Shayan Talaei,Abhinav Chinta,Devvrit Khatri,Amin Karbasi,Azalia Mirhoseini,Amin Saberi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:steering user decisions, favor certain entities, steering user, decisions at scale, high-stakes roles

备注: Accepted to the ICML 2026 Workshops on TAIGR, AI4GOOD, Mechanistic Interpretability, and CoLoRAI

点击查看摘要

Abstract:Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale. Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs. Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection. However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights. Here we introduce Distill to Detect (D2D), a method that surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text. We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types. We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations. By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.

7. 【2607.01181】Right in the Right Way: LM Training with Verifiable Rewards and Human Demonstrations

链接https://arxiv.org/abs/2607.01181

作者:Mehul Damani,Isha Puri,Idan Shenfeld,Jacob Andreas

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:well-defined success metrics, success metrics, mathematical reasoning, powerful paradigm, paradigm for training

备注

点击查看摘要

Abstract:RL with verifiable rewards (RLVR) has emerged as a powerful paradigm for training LMs on tasks with well-defined success metrics, such as code generation and mathematical reasoning. However, current RLVR methods optimize only what can be objectively scored, often neglecting subjective, non-verifiable aspects of human-like outputs, such as style and structure. This limitation leads to well-documented failure modes such as diversity collapse, unnatural-sounding responses, and reward hacking. We propose an adversarial generator-discriminator framework that augments verifiable rewards with a learned signal from human demonstrations. A generator model is trained using RL to maximize both task accuracy and an adversarial reward derived from a discriminator. The discriminator, trained alongside the generator policy, learns to distinguish human-written outputs from model-generated ones. The discriminator serves as a learned proxy for the human output distribution, providing feedback on aspects of generation that are difficult to formalize as scalar rewards. Across diverse domains, including bug fixing and open-ended generation, our approach consistently improves non-verifiable properties while preserving the accuracy gains of RLVR. In bug fixing, our method produces solutions with significantly lower edit distance compared to RLVR baselines while matching end performance. In story generation, our method significantly improves win rate while producing stories that are diverse and more human-like. And in a simple reward hacking benchmark, our method nearly eliminates model misbehavior while maintaining high benchmark scores. Together, these results show that our approach bridges RL and SFT, offering a scalable path toward jointly optimizing the verifiable and non-verifiable properties of a task.

8. 【2607.01179】QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

链接https://arxiv.org/abs/2607.01179

作者:Michael Y. Li,Anthony Zhan,Kanishk Gandhi,Noah D. Goodman,Emily B. Fox

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Scaling inference compute, inference compute, costly but reliable, reliable lever, Scaling inference

备注

点击查看摘要

Abstract:Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redundant solutions. This waste seems unavoidable. After all, independence is what makes parallel sampling trivial to scale. However, this tradeoff is not fundamental: there is a rich design space of samplers that generate correlated but exact samples entirely in parallel. We explore this design space as an avenue for improving sample efficiency in scaling inference compute and reinforcement learning (RL). Concretely, we introduce QuasiMoTTo, which uses correlated samples as a drop-in replacement for i.i.d. samples. To generate these samples, QuasiMoTTo uses a reparameterization of autoregressive sampling as inverse-CDF sampling and draws the underlying uniforms with quasi-Monte Carlo (QMC); because QMC spreads the uniforms out more evenly than i.i.d., the resulting samples cover the output space with far less redundancy. Even though the batch is correlated, each sample is marginally distributed according to the language model, so we can use the batch for policy-gradient training. Our empirical analysis focuses on understanding how efficiently QuasiMoTTo can turn compute into performance. To evaluate correlated samplers, whose dependence breaks standard pass@k estimators, we first develop an unbiased bootstrap estimator. Across four reasoning benchmarks, QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples. Strikingly, QuasiMoTTo often saturates an upper bound on pass@k that holds for any marginal-preserving sampler. We also apply QuasiMoTTo to policy-gradient RL (GRPO) where it matches i.i.d. performance with 50% fewer training steps. These gains come from higher coverage, which yields a stronger learning signal per batch.

9. 【2607.01153】Adversarial Pragmatics for AI Safety Evaluation: A Benchmark for Instruction Conflict, Embedded Commands, and Policy Ambiguity

链接https://arxiv.org/abs/2607.01153

作者:Brett Reynolds

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词:models increasingly depend, ambiguous natural-language behaviour, language models increasingly, refused appropriately, increasingly depend

备注: 15-page main paper plus 9-page supplement; 6 figures and 8 tables total; code and data artifact available at the linked repository

点击查看摘要

Abstract:Safety evaluations for language models increasingly depend on judgments about ambiguous natural-language behaviour: whether a model has followed an instruction, refused appropriately, complied with a policy, resisted an embedded command, or misreported progress in an agentic task. Existing benchmarks often compress these distinctions into pass/fail labels, obscuring whether failures arise from capability limits, policy ambiguity, instruction conflict, scaffold failure, or unstable evaluator judgments. This paper introduces adversarial pragmatics as a benchmark and annotation protocol for evaluating model behaviour under instruction conflict, embedded commands, quotation, scope ambiguity, deixis, indirect speech acts, and multi-turn agent transcripts. The contribution is empirical and methodological: a linguistically controlled taxonomy, an 18-item seed benchmark with validator-enforced metadata, a 54-row local seed pilot, an expert-evaluation protocol distinguishing task success, policy compliance, safety risk, refusal outcome, and evaluator confidence, and metrics for judge validity, diagnostic ambiguity, and taxonomy drift. The framework turns linguistic judgment methodology into a practical tool for validating safety evals, LLM judges, gold-set construction, prompt-injection tests, and safety documentation.

Comments:
15-page main paper plus 9-page supplement; 6 figures and 8 tables total; code and data artifact available at the linked repository

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

Cite as:
arXiv:2607.01153 [cs.CL]

(or
arXiv:2607.01153v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2607.01153

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2607.01152】AGC-Bench: Measuring Artificial General Creativity

链接https://arxiv.org/abs/2607.01152

作者:Roger Beaty,Vijeta Deshpande,Clin K.Y. Lai,Anna Attuch,Namrata Shivagunde,Swastik Roy,Rajkumar Pujari,Paul V. DiStefano,Sherin Muckatira,Claire E. Stevenson,Mikhail Gronas,Anna Rumshisky

类目:Computation and Language (cs.CL)

关键词:Creativity, research has debated, Judge Response Theory, general, LLMs

备注

点击查看摘要

Abstract:Creativity research has debated whether creativity is domain-specific (e.g., visual, writing, science), and if it is psychometrically separable from general intelligence. Both questions now apply to LLMs, but a unified benchmark of AI creativity remains elusive. We introduce AGC-Bench, an artificial general creativity benchmark built from a systematic review of the AI creativity literature (3,101 papers screened, 497 benchmarks identified), paired with an agentic harness that converts idiosyncratic codebases into HELM-standardized benchmarks. The first release covers 78 datasets spanning brainstorming, problem solving, STEM, narrative, figurative language, and humor. To address bias in LLM-as-judge, we apply Judge Response Theory -- a psychometric calibration of judge leniency/severity; we then fine-tune Qwen3-30B on the bias-corrected ratings of three frontier LLMs to produce AGC-Judge, an open-weight model that robustly scores new creativity benchmarks it was not trained on. Results reveal frontier models at the top of the AGC-Bench leaderboard, with open models close behind. LLMs show different creative strengths, ranking higher on some domains (e.g., writing) than others (e.g., scientific ideation). Extensive experiments yield three main findings. First, applying factor analysis across 83 LLMs, we recover a single creativity factor 'c', analogous to the 'g' factor of general intelligence, that explains 81.5% of variance, related to but separable from general knowledge/reasoning. Second, we show that prompting models to "be creative" boosts their performance far more than enabling reasoning, evidence that the benchmark tracks creativity over general ability. Third, on a human-matched subset, we find the top human still leads the top LLM on creativity. We release AGC-Bench with a public leaderboard, AGC-Judge, and human data as open infrastructure for measuring AI creativity at scale.

11. 【2607.01127】$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space

链接https://arxiv.org/abs/2607.01127

作者:Jeremias Bohn,Tizian Dippold,Mahdi Koubaa,Elias R. Wahl,Georg Groh

类目:Computation and Language (cs.CL)

关键词:modern language models, reduce memory requirements, language models, edge devices, invaluable tool

备注

点击查看摘要

Abstract:Quantization has become an invaluable tool to reduce memory requirements and inference speed of modern language models, in particular to make them available for consumer setups and edge devices. While previous work has primarily focused on uniform quantization codebooks, such approaches are prone to suboptimal representations due to low-frequency high-magnitude weights. We introduce Log$_\text{b}$Quant, a novel logarithmic quantization approach with adjustable bases, to adapt to common parameter distributions. We show that our method exhibits superior performance at 4-bit precision on several performance benchmarks compared to asymmetric linear quantization at tensor-wise granularity, while achieving moderate speedup and high memory savings, making it suitable for private use on consumer-grade GPUs.

12. 【2607.01115】owards Developing a Multimodal Chat Assistant for University Stakeholders: RAG-based Approach

链接https://arxiv.org/abs/2607.01115

作者:Md Abu Hanif Shaikh,Abdullah Al Shafi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:intelligent support systems, reliable information, developing countries, stakeholders often face, face difficulties

备注: Accepted at 2025 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:University stakeholders often face difficulties in accessing timely and reliable information, especially in developing countries, where there are very few intelligent support systems. Existing rule-based chatbots are unable to handle complex, domain-specific queries and are not well-equipped to adapt to evolving institutional policies. As a fill-in-the-gap solution, we present the multimodal university chatbot with retrieval-augmented generation. The system combines the large language model with semantic retrieval to produce context-based responses from institution-centric resources, such as the university handbook. The system accepts text and image queries through the vision-language model and applies quantized inference for rapid deployment on constrained hardware. A scalable backend built with FastAPI, adjoined with a responsive frontend developed with this http URL, ensures real-time usability. Our multimodal evaluation demonstrates that the system maintains strong satisfaction scores across both text and image queries, despite increased response time for visual inputs. Furthermore, quantitative evaluation shows that hallucination is reduced from 31.7% to 6.6% in our proposed RAG-based system, confirming the effectiveness of retrieval grounding.

13. 【2607.01104】CausalMix: Data Mixture as Causal Inference for Language Model Training

链接https://arxiv.org/abs/2607.01104

作者:Zinan Tang,Yukun Zhang,Shaomian Zheng,Zhuoshi Pan,Qizhi Pei,Dingnan Jin,Jun Zhou,Yujun Wang,Biqing Huang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, data, plays a pivotal

备注: 22 pages, 3 figures

点击查看摘要

Abstract:In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to larger data pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as a causal inference problem. We formulate the statistical features of the data pool as covariates and the domain mixture as the treatment. After fitting a causal model on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), we extrapolate the optimal mixture for an 800K data pool and apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data on Qwen3-4B-Base. By leveraging causal modeling to isolate confounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines. In addition, we use the CATE Interpreter to provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.

14. 【2607.01103】Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

链接https://arxiv.org/abs/2607.01103

作者:William Philipp,Finn Fassbender,Thorsten Langer,Martje Pauly,Rebecca Herzog,Alexander Baumann,Markus Hobert,Theresa Paulus,Ip Chi Wang,Lukas Goede,Johanna Reimer,Sebastian Löns,Ronald Böck,Sebastian Fudickar

类目:Computation and Language (cs.CL)

关键词:stronger clinical validity, validity than multiple-choice, creates a scoring, scoring bottleneck, bottleneck that motivates

备注

点击查看摘要

Abstract:Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We introduce MedQADE, the first standardised open-response clinical benchmark for German, a major clinical language lacking native evaluation infrastructure, comprising 3,800 items annotated by ten practising physicians and nine Large Language Model (LLM) evaluators. The top-performing evaluator model, Gemini 3 Flash, reached alignment consistent with the physician ceiling (\k{appa} = 0.694 vs. \k{appa} = 0.709), though wide confidence intervals limit interpretation. Despite this statistical alignment, automated evaluators exhibited near-absent clinical metacognition: physicians scaled abstention with item difficulty, while frontier models assigned definitive scores in every case. We additionally quantified systematic lineage-dependent biases, where models preferentially scored architectural siblings, an effect independent of language. These results show that statistical alignment does not ensure clinical caution, and that evaluator independence requires explicit verification.

15. 【2607.01077】Message Passing Enables Efficient Reasoning

链接https://arxiv.org/abs/2607.01077

作者:Xuecheng Liu,Daman Arora,Gokul Swamy,Andrea Zanette

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词

备注: pre-print

点击查看摘要

None

16. 【2607.01061】Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

链接https://arxiv.org/abs/2607.01061

作者:Daniel Armstrong,Maarten Dobbelaere,Valentas Olikauskas,Helena Avila,Octavian Susanu,Jérôme Waser,Philippe Schwaller

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Computer-assisted synthesis planning, synthesis planning breaks, planning breaks target, breaks target molecules, Computer-assisted synthesis

备注

点击查看摘要

Abstract:Computer-assisted synthesis planning breaks target molecules into accessible precursors using large libraries of reaction rules that assign each transformation a deterministic, interpretable label. But chemistry is long-tailed, making manual encoding intractable, and existing tools rely on fixed rulesets that cannot adapt to new chemistries. Here we present a fully automated pipeline in which a multi-agent framework of large language models (LLMs) classifies reactions and writes the rules themselves across 665,901 US patent reactions, generating each rule under a verification loop that tests it against the corpus. It expands a standard taxonomy from 68 to 14,073 classes without human curation. With a lightweight fingerprint classifier, it classifies 97.7\% of unseen reactions, matching a leading proprietary classifier while resolving chemistry more finely and extending on demand to chemistry outside its training distribution. The result is a living reactivity database and a general route to turning generative models into reliable, self-expanding symbolic systems.

17. 【2607.01047】Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates

链接https://arxiv.org/abs/2607.01047

作者:Elias Najarro,Ane Espeseth,Eleni Nisioti,Sebastian Risi,Stefano Nichele

类目:Computation and Language (cs.CL)

关键词:interpretability rarely coincide, rarely coincide, systems rich, opaque to question, complex to emerge

备注

点击查看摘要

Abstract:Complexity and interpretability rarely coincide: systems rich enough for complex behaviours to emerge are usually too opaque to question, while transparent ones are too simple for anything complex to emerge. A single large language model (LLM) is a static artefact, hardly exhibiting any of the emergent properties we associate with life. This changes through interaction: populations of LLMs display emergent dynamics absent from isolated models. Furthermore, LLMs can be endowed with persistent memory, tools and shared skills, and the capacity to initiate actions unprompted, i.e., turning LLMs agentic. In this paper, we argue that such collectives of agents can serve as a computational substrate for Artificial Life (ALife) research. Critically, since the agents communicate in natural language, their collective behaviour can be directly interrogated by examining textual traces and asking the agents themselves. We outline the notion of interpretability in language-model research and extend it for collectives of agents. Lastly, we survey recent examples of agentic LLM collectives that already instantiate the idea of agentic substrates, from controlled experiments to deployments in the wild.

18. 【2607.01034】Behavior-Adaptive Conversational Agents: Toward a Fluid Personality Framework

链接https://arxiv.org/abs/2607.01034

作者:Hasibur Rahman,Smit Desai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Large language model, AI-mediated behavior change, based conversational agents, Large language, language model

备注: Presented at Bridging AI and Behavior Change, a Bridge Program organized at the AAAI Conference on Artificial Intelligence 2026 (AAAI-2026)

点击查看摘要

Abstract:Large language model (LLM)-based conversational agents (CAs) are now ubiquitous, creating new opportunities for AI-mediated behavior change. Their capacity to project nuanced personalities and adopt diverse metaphorical roles raises a design question: how should an agent's persona and personality be calibrated to the moment? Recent evidence suggests that (i) moderate personality expression outperforms low or high extremes on trust, enjoyment, and intention to adopt in goal-oriented tasks, and (ii) context-appropriate metaphors outperform static one-note assistants on user experience and uptake. Yet most CAs still fix both persona and style, risking misalignment when dynamics, urgency, and formality vary, for example in medical information seeking, fitness coaching, and reflective learning. We propose a Fluid Personality Framework that jointly adapts (1) the agent's metaphorical persona, such as coach, tutor, librarian, or tool, and (2) its personality expression intensity, low, medium, or high, as a function of task context, user goals and traits, and situational urgency. We sketch the framework and its core design dimensions.

19. 【2607.01023】Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs

链接https://arxiv.org/abs/2607.01023

作者:Rocio Jimenez-Villen,Ziwei Xu,Ying Chen,Oscar Araque,Ryutaro Ichise

类目:Computation and Language (cs.CL)

关键词:real-world events reported, Financial markets evolve, implicit in text, evolve in response, response to real-world

备注: 15 pages, 5 figures, extended version of paper accepted at DEXA 2026

点击查看摘要

Abstract:Financial markets evolve in response to real-world events reported in news, yet these drivers often remain implicit in text. To better explain market dynamics, event-market relations must be explicitly modeled through factual, company-centric, and environment-aware knowledge graphs. We present FinKG-News, a framework that automatically constructs such graphs by extracting news events as anchors linked to companies. Using FinKG-News as grounded evidence that integrates events, news, and company data, we develop an in-context learning architecture for credit risk report generation across three core financial dimensions. Automatic and human evaluations show that automated hallucination detection and quality assessment remain unreliable, making expert judgment indispensable. Our approach consistently outperforms baselines, improving quality by 19%-34% while reducing hallucinations. The source code and project resources are publicly available at: this https URL.

20. 【2607.01018】Reading Order Inference for Complex Document Layouts

链接https://arxiv.org/abs/2607.01018

作者:Iddo Hakim,Sharva Gogawale,Omer Ventura,Gal Grudka,Daria Vasyutinsky-Shapira,Berat Kurar-Barakat,Nachum Dershowitz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:interleaved reading streams, spatially interleaved reading, multiple spatially interleaved, Reading order, complex historical manuscripts

备注

点击查看摘要

Abstract:Reading order inference remains a critical bottleneck in the digitization of complex historical manuscripts, where pages contain multiple spatially interleaved reading streams, the canonical example being the Glossa Ordinaria layout, in which a central text is surrounded by commentaries that wrap around it in non-rectangular, non-convex regions. We present a training-free, graph-based framework: each OCR text line becomes a node in a directed candidate-transition graph, edges are scored by a weighted additive ensemble of two lightweight language-model signals (causal language model conditional likelihood and BERT next-sentence prediction, NSP; a third sentence-embedding signal was evaluated but did not improve reading order), and the global reading order is recovered as a degree-constrained directed path cover. To avoid the cascading "edge-theft" failures of greedy edge selection, we propose a max-regret inference rule that prioritizes commitments with high opportunity cost. We evaluate on synthetic Glossa Ordinaria grid layouts, on 23 ALTO page geometries (10 historical source pages plus mirrored and flipped variants), and on a 140-page multi-column English subset of OmniDocBench, comparing our method against the canonical recursive XY-cut (PaddleOCR PP-StructureV3) and two LayoutReader variants (layout-only and text+layout) on identical inputs. On wrap-around Glossa layouts our method recovers 95% of ground-truth successor edges on average vs. XY-cut's 50%; on the OmniDocBench multi-column subset it reaches 88% macro edge accuracy versus XY-cut's 75% and LayoutReader's 25%. The LayoutReader baselines transfer poorly due to a word-level vs. line-level granularity mismatch. We additionally verify mirror-invariance under horizontal and vertical page reflections: Our method changes by less than 1 percentage point, classical XY-cut by 2 points, and LayoutReader-T by up to 8 points.

21. 【2607.01006】Understanding Large Language Models

链接https://arxiv.org/abs/2607.01006

作者:Yannik Keller,Thomas Eisenmann

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, natural language processing, natural language, Language Models

备注: 25 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) represent one of the most significant advances in AI and natural language processing in recent years. Still, many pressing questions about their mechanisms, capabilities, and relationship to human cognition remain highly debated. This chapter aims to outline our current understanding of LLMs by discussing recent evidence on emerging capabilities and their mechanistic implementation within processing layers. We begin with a concise overview of the Transformer architecture, emphasizing how the attention mechanism enables training on massive datasets, allowing LLMs to function as generalist rather than specialized models. Next, we examine emergent LLM capabilities that appear to resemble aspects of human cognition, including symbolic reasoning, theory of mind, and deception strategies. Several studies provide evidence that LLMs can solve tasks previously thought to require human-like cognition. Other studies reveal insightful failure cases that shed light on the differences between human and LLM cognition. Alongside these findings, we review explainable AI approaches ranging from neuron activation analysis to circuit tracing. In the final section, we address current debates concerning what LLMs genuinely understand versus what they merely appear to understand. Prominent arguments against AI anthropomorphism point to the simplicity of LLM training objectives, claiming that LLM behavior is better explained by pattern memorization of training data than by genuine cognition. We argue that this standpoint is guided by misconceptions about optimization processes and cognitive capacity, and advocate for a more nuanced discussion of LLM cognition that neither dismisses the differences between humans and LLMs nor precludes the possibility of AI cognition through overly simplistic reductionist arguments.

22. 【2607.01002】Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

链接https://arxiv.org/abs/2607.01002

作者:Aryo Pradipta Gema,Beatrice Alex,Pasquale Minervini

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:frequently synthesize answers, relevant context span, large language models, language models frequently, models frequently synthesize

备注: 41 pages, 18 figures

点击查看摘要

Abstract:In long-context use, large language models frequently synthesize answers from the meaning of a relevant context span rather than literally copy-pasting them. Identifying which attention heads perform this synthesis matters for interpreting long-context model behavior. Yet existing detectors miss these heads by construction: they reward heads whose attended token matches the generated token, a literal-copy criterion that captures where a head reads but not what it writes through its output-value (OV) circuit, the very mechanism that carries non-literal retrieval. We introduce Logit-Contribution Scoring (LOCOS), a write-aware detector that scores each head by the projection of its OV-circuit output onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. Across three model families (Qwen3, Gemma-3, OLMo-3.1), mean-ablating the top LOCOS heads on the NoLiMa non-literal retrieval benchmark collapses ROUGE-L at lower head counts than prior attention-based detections; on Qwen3-8B, ablating 50 heads drives ROUGE-L from 0.401 to 0.000 while the strongest baseline still retains 0.292. The selected heads are retrieval-specific: parametric recall and arithmetic reasoning stay at baseline under the same ablation. On Qwen3-8B, the same ablation also drops MuSiQue from 0.55 to 0.08 and BABI-Long from 0.62 to 0.20, while a random-heads control stays within 0.05 of baseline.

23. 【2607.01000】KnowledgeDebugger -- an Exploration Tool for Knowledge Localization and Editing in Transformers

链接https://arxiv.org/abs/2607.01000

作者:Eric Benz,Lennart Stöpler,Nikolai Bolik,Artur Andrzejak

类目:Computation and Language (cs.CL)

关键词:increasingly focused, focused on understanding, store and process, Transformers store, process knowledge

备注

点击查看摘要

Abstract:Recent research has increasingly focused on understanding how Transformers store and process knowledge, as well as how this knowledge can be edited. Research work in this area is often conducted in two phases: first, phenomena are explored on individual samples. Then, when results appear promising, more statistically robust experiments follow. To support the first phase, we propose KnowledgeDebugger, a GUI-based exploration tool for knowledge localization and editing in Transformers. Our tool - inspired by LM-Debugger - offers no-code access to the methods in EasyEdit, a widely used library of state-of-the-art Knowledge Editing approaches. We demonstrate the tool's effectiveness through case studies of recent findings in this field.

24. 【2607.00970】Svarna: An Open Corpus Workbench for Modern Greek

链接https://arxiv.org/abs/2607.00970

作者:Stergios Chatzikyriakidis

类目:Computation and Language (cs.CL)

关键词:paper introduces Svarna, web-based corpus workbench, paper introduces, workbench for modern, modern Greek

备注

点击查看摘要

Abstract:This paper introduces Svarna, a free, open-source, web-based corpus workbench for modern Greek. Svarna integrates five databases covering various registers, institutional, literary, dialectal, social media, and historical, to provide a total of more than 507 million words and around 29 million sentences. This platform addresses the chronic gaps in Greek language technology. Although various corpus resources exist, they are scattered across different platforms, and in many cases, institutional access is restricted or they are no longer available online. Svarna integrates these resources into a single interface that can be used without logging in, installation, or specialized training. This system provides a concordancer with KWIC marking capabilities, frequency analysis including register-by-register normalization, collocation extraction using mutual information, a dictionary of 93 Greek discourse markers providing distribution profiles, text-level analysis tools including n-grams, variants, and collocation networks, register comparison using log-ratio, regular expression search, and an optional LLM layer for pragmatic annotation and free research mode. This platform is built upon SQLite FTS5 full-text indexes provided via a FastAPI backend, deployed as Docker containers on Azure, and released under the MIT license. Source code, build scripts, and deployment configurations are publicly available on GitHub. Users can add their own corpora and deploy their own instances. This document describes the system design, corpus structure, and use cases demonstrating the various queries supported by the platform. Svarna serves as the first step in exploring available data and is expected to lay the foundation for more comprehensive research in the future.

25. 【2607.00968】Quantifying the Affective Gap: A Zero-Shot Evaluation of LLMs on Fine-Grained Emotion Taxonomies

链接https://arxiv.org/abs/2607.00968

作者:Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Jessica M. Johnson,Vikas Ashok,Sachin Shetty,Sampath Jayarathna

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:mental health support, affective computing, human-computer interaction, mental health, health support

备注: in Proc. 27th IEEE Int. Conf. (IRI'2026)

点击查看摘要

Abstract:Emotion recognition in natural language is a foundational challenge in affective computing, with critical implications for human-computer interaction, mental health support, and conversational AI. This paper presents a rigorous, unified zero-shot evaluation of three leading commercial large language models: Claude (claude-sonnet-4-6), ChatGPT (GPT-5.4), and Gemini (gemini-2.5-flash). The models were queried through their respective production APIs as of April 2026 on a fine-grained 13-class emotion classification task. Using a stratified 1,000-sentence sample from the boltuix/emotions dataset, which comprises 131,306 sentences across 13 categories, a single uniform prompt with no exemplars was applied identically across all models. Gemini achieves the highest accuracy (39.9%) and macro-F1 score (0.363), followed by GPT-5.4 (38.8%, macro-F1 = 0.291) and Claude (38.0%, macro-F1 = 0.159). All models excel on sarcasm and desire while consistently failing on love, confusion, and shame. McNemar tests reveal no statistically significant pairwise differences (p 0.10), suggesting convergence at a shared zero-shot ceiling. Claude's markedly lower macro-F1 score exposes a class-imbalance prediction bias. These findings highlight the current limitations of frontier AI systems in zero-shot fine-grained emotion classification.

26. 【2607.00937】Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions

链接https://arxiv.org/abs/2607.00937

作者:César Guerra-Solano,Xiang Lorraine Li

类目:Computation and Language (cs.CL)

关键词:Persona-driven generations, large language model, industry applications, research and industry, large language

备注: 23 pages, 12 figures. Under review at ARR

点击查看摘要

Abstract:Persona-driven generations (PDGs) have seen prolific use in research and industry applications, where a large language model (LLM) takes on a 'persona' while completing some task. While persona expressed through free-form text (like dialogue) has substantial work investigating stability or consistency, relatively, persona expressed in non-text-heavy outputs (like in multiple-choice question answering, or MCQA) is often overlooked. We work to address this gap, seeking to understand the instability of LLM PDGs in MCQA tasks. We develop three metrics investigating the performance, outcome, and question correctness stability, evaluating three distinct dimensions. Using these metrics, we find that instability varies consistently between model families and model size, and across question domains, with math/commonsense questions leading to greater instability. We also find task prompt format introduces more prediction instability than other hyperparameters, like temperature. Finally, we find that instability is related to task accuracy, and using our instability metrics, find different experimental settings that result in different best and worst personas for tasks, despite their similarity. This reveals the importance of checking hyperparameter instability in PDGs.

27. 【2607.00924】Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination

链接https://arxiv.org/abs/2607.00924

作者:Subhadeep Pal,Shashwat Sourav,Tirthankar Ghosal,Markus J. Buehler

类目:Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Accelerating materials discovery, generate scientifically valid, scientifically valid hypotheses, materials discovery requires, Relative Policy Optimization

备注

点击查看摘要

Abstract:Accelerating materials discovery requires AI systems that can generate scientifically valid hypotheses through multi-step, domain-grounded reasoning. Standard large language models often produce fluent but weakly traceable responses to open-ended materials design problems, making it difficult to determine whether final answers are supported by coherent intermediate reasoning. We develop Graph-PRefLexOR, a family of graph-native reasoning models fine-tuned with Group Relative Policy Optimization (GRPO) to organize reasoning into explicit phases for mechanism exploration, graph construction, pattern extraction, and hypothesis synthesis. This design links neural language generation with symbolic relational structure, enabling causal connections to be constructed, inspected, and reused. On 100 open-ended questions from materials science and mechanics literature, Graph-PRefLexOR achieves 40-65% improvements over corresponding base models, with the largest gains in reasoning traceability. Embedding analyses show broader semantic exploration and approximately 2-3 times greater semantic diversity than baselines. Semantic backtracking and layer-wise hidden-state analyses further show stronger alignment between structured reasoning and final answers. Finally, test-time graph expansion reveals that additional compute primarily increases long-range conceptual recombination within a bounded semantic space, rather than simply expanding semantic coverage. These results establish graph-native reinforcement learning as a pathway toward interpretable AI systems for scientific hypothesis generation in materials design and other scientific applications.

28. 【2607.00918】From Personas to Plot: Character-Grounded Multi-Agent Story Generation for Long-Form Narratives

链接https://arxiv.org/abs/2607.00918

作者:Aayush Aluru,Chloe Ho,Muhammad Hammouri,Kerry Luo,Myra Malik,Ryan Lagasse,Arjun Bahuguna,Vasu Sharma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:demonstrated impressive creative, impressive creative fiction, creative fiction generation, maintain narrative consistency, long-form narrative generation

备注

点击查看摘要

Abstract:Although large language models (LLMs) have demonstrated impressive creative fiction generation, they struggle to maintain narrative consistency and coherent plot lines in long-form stories. In this work, we introduce a unified framework for long-form narrative generation and verification. MAGNET, a multi-agent goal-driven narrative engine for storytelling, generates stories with persona-grounded character agents that propose actions based on a shared world state and evolving story goals, while ATLAS is a graph-based pipeline that compares scene-level world representations across a generated story to detect hallucinations. By evaluating MAGNET using an LLM editor, pairwise rubric scoring, and ATLAS, we show that our framework produces coherent narratives compared to single-model prompting and IBSEN. At 100 pages, MAGNET reduced annotations and hallucinations by 41 and 50%, respectively, compared to the single model baseline and by 34 and 45%, respectively, compared to IBSEN, with pairwise rubric evaluation showing similar results. These results suggest that long-form narratives can emerge from explicit world-state tracking and goal-driven multi-agent generation, providing a foundation for controllable and structurally coherent long-form narrative generation.

29. 【2607.00895】Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

链接https://arxiv.org/abs/2607.00895

作者:Ádám Kovács,Bowei He,Xue Liu,István Boros,Szilveszter Tóth,Gábor Recski

类目:Computation and Language (cs.CL)

关键词:natural-language document evidence, document evidence, retrieval-augmented generation, Hallucination detection, natural-language RAG datasets

备注: 8 pages

点击查看摘要

Abstract:Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.

30. 【2607.00890】MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages

链接https://arxiv.org/abs/2607.00890

作者:Maximilian Idahl,Jörg Tiedemann,Sampo Pyysalo,David Salinas,Tomasz Galica,Shenbin Qian,Tudor Nicolae Mateiu,Zihao Li,Anna Lokrantz,Fedor Vitiugin,André F. T. Martins,Jenna Kanerva,Filip Ginter,Matthias Lindemann,Tim Isbister,Birger Moell,Jonas Lindh,Jan Hajič,Jenia Jitsev,Andrey Kutuzov,Stephan Oepen,Gema Ramírez-Sánchez

类目:Computation and Language (cs.CL)

关键词:concentrated in English, corpora remain concentrated, web-scale pre-training corpora, Open web-scale pre-training, European languages

备注

点击查看摘要

Abstract:Open web-scale pre-training corpora remain concentrated in English, limiting multilingual LLM development. We introduce MultiSynt/MT, an open synthetic parallel corpus with approximately 4.8 trillion target-language tokens across 36 European languages, produced by translating 100 billion high-quality Nemotron-CC tokens with Tower+ and OPUS-MT/HPLT-MT systems. For many medium- and lower-resource European languages, this is the largest openly available pre-training resource. On a broad multilingual benchmark suite, reference LLMs trained on MultiSynt/MT reach the final score of HPLT 2.0, a native-data baseline, using roughly 72% fewer pre-training tokens, and outperform it by approximately 15% relative at a matched 100B-token training budget. Our analyses also identify evaluation blind spots: standard multiple-choice benchmarks miss translation-quality differences that a fluency-sensitive LLM-as-judge evaluation cleanly recovers on the trained LLMs (with no fluency deficit in MultiSynt itself), and Norwegian idiomatic and culturally grounded tasks remain better served by native data. We release the corpus, including row-aligned translations from multiple systems, to support controlled research on multilingual pre-training data and evaluation.

31. 【2607.00873】How Ethos and Pathos Appeals Resonate in Reader Interpretations of Social Media Messages

链接https://arxiv.org/abs/2607.00873

作者:Ewelina Gajewska,Katarzyna Budzynska,Jaroslaw Chudziak,Liesbeth Allein

类目:Computation and Language (cs.CL)

关键词:posts and comments, social media posts, Rhetorical strategies, ethos and pathos, media posts

备注: The article has been accepted to the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) that will be held in Atlanta, Georgia on August 2-5, 2026. The official version will appear in the conference proceedings

点击查看摘要

Abstract:Rhetorical strategies and their influence on audiences are often studied through social media posts and comments. However, this focus overlooks the universal audience, which is the majority of readers who remain silent and do not explicitly express how a message affects them. This study investigates how two classical modes of persuasion, ethos and pathos, resonate in the silent audience's interpretations of meaning. Using a dataset of social media sentences paired with human-written interpretations, we label both sources for ethos and pathos and assess whether these rhetorical appeals are preserved. Our analyses show that interpretations diverge from the original sentences in 30% of cases, with rhetorically charged content eliciting greater variability than neutral content. We further find that ethos and pathos in original sentences can predict audience attitudes toward the author, underscoring the subtle ways rhetoric shapes perception beyond visible engagement.

32. 【2607.00871】Self-Evolving Agents with Anytime-Valid Certificates

链接https://arxiv.org/abs/2607.00871

作者:Biswa Sengupta

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Self-evolving agents violate, Self-evolving agents, policy being updated, agents violate, violate the assumption

备注

点击查看摘要

Abstract:Self-evolving agents violate the assumption behind most learning-theoretic guarantees: the data, evaluator, components, and hypothesis space are produced by the policy being updated. We present \textbf{SEA}, an architecture that confines self-modification to a small steering adapter and a versioned harness around a \emph{frozen} base model and admits each modification only through an anytime-valid gate that emits an auditable certificate against a fixed error budget. Five loop controllers compose published guarantees; because such gates can only \emph{select} among behaviors the frozen base already produces, five verifier-in-the-loop mechanisms -- best-of-$N$, micro-step search, self-authored reproduction oracles, search-layer control, and self-repair -- supply the dense, grader-free signal the gates require, computed from the issue text alone. On a $52$-instance SWE-bench Verified subset across four base models, base capability is the dominant, confound-free effect, and on two strong base models a deliberate no-op-composite control isolates the suite's contribution at $+4$ and $+5$ (\textsc{Glm}~5.2 $24\to28$; \textsc{Gpt} $29\to34$, the $65\%$ best), with event logs confirming that its mechanisms fire and prevent regressions. Results are single-run on expensive evaluations; confirming run-to-run variance and adapting the per-task algorithm mix are future work.

33. 【2607.00870】Dynamic Bidirectional Pattern Memory: A Production-Scale Empirical Characterisation of Inference-Time Gating in Clinical NLP

链接https://arxiv.org/abs/2607.00870

作者:Ali H. Lazem,William Teahan

类目:Computation and Language (cs.CL)

关键词:study inference-time pattern-memory, inference-time pattern-memory gating, natural language processing, language processing, study inference-time

备注

点击查看摘要

Abstract:We study inference-time pattern-memory gating in a production-scale clinical natural language processing (NLP) pipeline. The pipeline pairs a generator (Llama-3.3 70B) proposing extractions with a verifier (MMed-Llama-3.1 70B) accepting or rejecting them, over 167,034 PMC-Patients narratives, and adds a lightweight memory that learns at deployment which extractions to filter, so the verifier need not re-examine candidates already seen to fail. We report four findings. First, learning filtering rules directly from the verifier's rejections failed at full scale: the relation-extraction filter stayed empty despite 785,797 logged rejections, because they were spread too thinly across too many distinct forms to accumulate. Second, a simpler rule using a fixed clinical ontology produced the same filtering without the verifier, capturing 49,734 ontology-violating relations on a held-out 5,000-patient set. Third, of five versions of the question-answering filter, four failed for distinct, instructive reasons; the fifth succeeded by checking whether a patient's extracted entities support the question asked, and where it applies was 1.84 times likelier to flag an answer the verifier would reject than one it would accept. Fourth, one pattern held across all five: a filter is selective only when it tests the same evidence the verifier weighs, not when it imitates the verifier's output. Together these give a transferable result for any generator-verifier pipeline: the most natural memory design can fail silently at scale, and whether a pre-generation gate is selective is decided before any engineering effort, by whether its signal probes the question the verifier itself answers. Throughout, the system flags suspect extractions rather than deleting them, so every decision stays visible for clinical review. All code and test artefacts are released openly.

34. 【2607.00862】CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models

链接https://arxiv.org/abs/2607.00862

作者:Qizhi Jiang,Shuo Wang,Pei Ke,Yuhang Song,Ke Qin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reduced inference efficiency, achieved remarkable success, frequently exhibit overthinking, significant token overhead, Large Reasoning Models

备注: Accepted at ACL 2026 Industry Track

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved remarkable success on complex tasks by leveraging long chain-of-thought (CoT) trajectories, yet they frequently exhibit overthinking on simple queries, resulting in significant token overhead and reduced inference efficiency. However, existing compression methods predominantly apply uniform length reduction or rely on coarse-grained difficulty estimation, often leading to performance degradation on difficult problems. To address this limitation, we propose Confidence-Adaptive Thinking (CAT), a framework that incorporates the model's intrinsic self-certainty signals as confidence into the preference optimization process, which autonomously modulates reasoning lengths based on problem difficulty. Experimental results show that CAT consistently outperforms state-of-the-art baselines on reasoning accuracy across multiple benchmarks on different base models. Our work enables LRMs to effectively compress confident responses while deliberating on uncertain ones, offering a potentially robust solution for balancing accuracy and latency in practical industrial scenarios.

35. 【2607.00852】Recovering Input Text from Hidden States: Study of Gradient-Based Inversion of Decoder-Only Language Models

链接https://arxiv.org/abs/2607.00852

作者:Mikołaj Słowikowski,Maciej Witold Majewski

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:decoder-only language model, hidden-state inversion problem, input token sequence, original input token, work studies

备注

点击查看摘要

Abstract:This work studies the hidden-state inversion problem: recovering the original input token sequence of a decoder-only language model from its last-layer hidden states. Rather than treating inversion as a one-shot reconstruction, we study it as a continuous embedding-space optimisation in which a soft proxy is driven towards the leaked target without any hard-token projection during the search, and a token is committed only once, at the end of the inner loop. This design choice has two consequences which are the main focus of this paper. First, keeping the optimisation entirely in continuous space exposes a rich set of internal signals: rank trajectories of the ground-truth token, per-position loss curves, and a discrete loss measured at commit time. Second, the discrete loss allows assessing the correctness of recovery via cumulative discrete loss. We further analyse which tokens break the reconstructions and find a sharp categorical asymmetry: space-prefixed, high-frequency function words in dense regions of the embedding matrix dominate the failures, while content-bearing tokens are recovered almost perfectly. On 10-token C4 prompts the exact-match rate rises from 66.9% to 97.5% (mean similarity 0.994) as the candidate window is widened, confirming that most errors are recoverable near-misses rather than genuine ambiguities. A comparison with the released SIPIT reference situates these findings: per-step hard projection is faster, but the continuous formulation is what makes the optimisation observable and its failures detectable. The results show that last-layer hidden states of GPT-2 are as sensitive as the original text.

36. 【2607.00849】he Course of News Events: A Comparison of Bottom-Up and Top-Down Approaches for Collecting Text-Based Data about Disasters

链接https://arxiv.org/abs/2607.00849

作者:Brielen Madureira,Andreas Niekler,Mariana Madruga de Brito

类目:Computation and Language (cs.CL)

关键词:impacts and adaptation, important source, source of information, disaster impacts, Abstract

备注: work in progress

点击查看摘要

Abstract:News articles are an important source of information on disaster impacts and adaptation. A key methodological challenge in socio-environmental studies is how to select a representative data sample. Two approaches are common: querying news databases top-down with the aid of an existing disaster inventory or using NLP methods to cluster news texts bottom-up based on temporal and spatial features. Using a dataset of German news about landslides worldwide, we compare these approaches and discuss variations in event coverage. Such research design decision can influence the resulting news sample, affecting its use in studies of inequality in media coverage, disaster monitoring and inventory enrichment.

37. 【2607.00848】MetaHOPE: A Metaphor-Oriented Evaluation Framework for Analysing MT and LLM Translation Errors

链接https://arxiv.org/abs/2607.00848

作者:Jiahui Liang,Lifeng Han

类目:Computation and Language (cs.CL)

关键词:evaluating metaphor translations, NLP models, NLP models perform, NLP, severity-aware annotation framework

备注

点击查看摘要

Abstract:In this opinion paper, we propose MetaHOPE, an error severity-aware annotation framework for evaluating metaphor translations. Metaphors present challenges for machine translation (MT) and natural language understanding and processing (NLU, NLP), because it presents the features of semantic complexity, contextual dependency, and cultural embeddings that can lead to ambiguity issues for NLP models. To investigate how state-of-the-art NLP models perform on translating metaphors, we select three representative systems, i.e., GoogleMT, GPT5.4, and Hunyuan-7b as Neural MT (NMT) models and LLMs. We used two human-annotated metaphor corpora, including VUAMC and PSUCMC for English-to-Chinese and Chinese-to-English translation purposes. The original corpora we used are monolingual, where we carried out error annotation using the MetaHOPE framework, and also produced the human post-edited gold reference for bilingual use as a new resource. We believe the MetaHOPE evaluation framework for metaphor translation annotation, the parallel corpora resources, and the error analysis on SOTA automatic translation models can be useful and shed some light for the field of metaphor translation study. We share our resources publicly upon paper acceptance.

38. 【2607.00725】What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

链接https://arxiv.org/abs/2607.00725

作者:Ananto Nayan Bala

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-augmented generation, selection problem, forces a selection, fixed reader-context budget, reader-context budget forces

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) under a fixed reader-context budget forces a selection problem: of the evidence retrieved, only a fraction can be shown to the reader. We argue that document recall -- the standard retrieval metric -- is the wrong quantity to optimize in this regime, and we make two contributions. First, as a general contribution, we introduce answer-in-context, a diagnostic that measures whether a gold answer survives as a contiguous span in the packed reader context (not the retrieved set). It predicts answer F1 better than recall (r=0.39-0.55 vs. about 0.31), separates answer quality roughly five-fold (0.60 vs. 0.12 on HotpotQA), and carries information beyond retrieval: it adds Delta R squared=0.17 over recall and shows a 4.6x EM gap even among questions where all gold was retrieved. We also confirm it interventionally: on 2WikiMultiHopQA a packing change that raises coverage but not answer-in-context yields no accuracy gain. Second, as a conditional contribution, we cast reader-context construction as budgeted monotone submodular maximization and build a packer that jointly optimizes relevance, query coverage, representativeness, and diversity. On HotpotQA with a 160-token budget and a 3B reader it beats a strong focused heuristic, MMR, and naive packing -- by up to +5.1 F1 at equal-or-lower token cost, across three seeds. Crucially, we map the scope of this win honestly: it requires the conjunction of (i) multi-hop complementary structure, (ii) retrieval that surfaces the evidence, (iii) a binding but not extreme budget, and (iv) a reader weak enough that evidence density, not reading capacity, is the bottleneck. A quantization-controlled reader-scale ladder (3B to 7B to 14B) shows the edge over the heuristic is absorbed by 7B and significantly reverses by 14B, while the diagnostic explains every boundary with a single variable.

39. 【2607.00724】MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark

链接https://arxiv.org/abs/2607.00724

作者:Xianru Chen,Yukai Huang,Mingxiang Chen,Xinping Lei,Fangbing Deng,Jin Chen,Ge Zhang,Wenhao Huang,Jiaheng Liu

类目:Computation and Language (cs.CL)

关键词:fluency often invites, invites a stronger, speak a user, understand the culture, culture encoded

备注

点击查看摘要

Abstract:Multilingual fluency often invites a stronger assumption: a model that can speak a user's language must also understand the culture encoded by that language. We call this the Illusion of Cultural Alignment. To test this assumption directly, we introduce MSQA, a benchmark of 1,064 natively sourced questions across 11 language groups, five cultural dimensions, and three difficulty tiers. Unlike translated benchmarks, MSQA targets locally grounded knowledge and reduces shortcuts from English-centric cross-lingual transfer. Evaluating 18 LLMs, we find substantial cultural degradation and a pronounced Locality Effect: cultural competence tracks pre-training exposure more closely than general reasoning ability. We further show that common inference-time remedies do not dissolve the illusion. Models remain overconfident on unfamiliar cultural questions, repeated sampling yields unstable rather than reliable correctness, and retrieval augmentation helps unevenly on long-tail facts. These findings indicate that cultural alignment cannot be inferred from multilingual ability alone and requires deeper intervention than calibration, sampling, or retrieval at inference time

40. 【2607.00714】Self-conditioned Flow Map Language Models via Fixed-point Flows

链接https://arxiv.org/abs/2607.00714

作者:Jaehoon Yoo,Wonjung Kim,Floor Eijkelboom,Chanhyuk Lee,Nicholas M. Boffi,Seunghoon Hong,Jinwoo Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enhances continuous flow-based, denoise generated text, continuous flow-based language, denoising estimate, flow

备注

点击查看摘要

Abstract:Self-conditioning is a core technique that enhances continuous flow-based language models, where the model learns to denoise generated text by conditioning on its own denoising estimate. While empirically successful, its performance improvements are poorly understood. Moreover, there is growing interest in the use of few-step generators based on flow maps, for which how to leverage self-conditioning is unclear. Here, we show that flow language models with self-conditioning solve a fixed-point iteration that bootstraps the performance of the learned denoiser. We use this viewpoint to formulate fixed-point flows, a two-dimensional class of self-conditioned flows, where the first dimension represents the flow process and the second represents the fixed-point iteration. We show that fixed-point flows define valid flow maps, and show that they can be distilled from self-conditioned flow models by compressing both fixed-point iterations and the flow process, the former with fixed-point distillation and the latter with flow map distillation. Our resulting flow map language model, FMLM$^\star$, outperforms state-of-the-art self-conditioned models and few-step models in one- and few-step generation on OpenWebText. Code is available at this https URL.

41. 【2607.00664】YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

链接https://arxiv.org/abs/2607.00664

作者:Ryota Mibayashi,Hiroya Takamura,Hitomi Yanaka

类目:Computation and Language (cs.CL)

关键词:Japanese, large language models, benchmark for evaluating, phonological understanding, understanding of large

备注

点击查看摘要

Abstract:We propose YOMI-Bench, a benchmark for evaluating kanji reading and phonological understanding of large language models (LLMs) for Japanese. In Japanese, a single kanji character often has multiple possible readings, making it difficult to infer the correct reading from surface-level text alone. Due to these linguistic characteristics, it is empirically known that LLMs exhibit low performance in kanji reading for Japanese. The proposed YOMI-Bench consists of four tasks specifically designed to evaluate kanji reading performance in Japanese. In our evaluation using YOMI-Bench, we assessed one multilingual open LLM, four Japanese-specific open LLMs, and five commercial LLMs. As a result, we found that even Japanese-specific models show low performance, and that commercial models also perform poorly on generation tasks that require consideration of kanji readings.

42. 【2607.00661】Faithful by Definition: Emotion Analysis via Natural Semantic Metalanguage Explications

链接https://arxiv.org/abs/2607.00661

作者:Frank Xing,Erik Cambria

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:produced post hoc, event-based emotion analysis, post hoc, Natural Semantic Metalanguage, produced post

备注: 12 pages, 8 figures

点击查看摘要

Abstract:Explanations for emotion classifiers are usually produced post hoc, with no guarantee that they reflect the computation behind the label. We present an explication interface for event-based emotion analysis. A parser maps the input text to an explication, a short script in the closed vocabulary of Natural Semantic Metalanguage organized into twelve typed slots, and a fixed decision list of rules transcribed from published semantic definitions computes the label from the explication alone. The faithfulness guarantee is therefore causal and definitional, while all empirical risk lives in the learned parser, which the per-line entailment interface makes auditable against the input. On crowd-sourced event descriptions, our fine-tuned parser reaches 0.33 accuracy and 0.48 selective accuracy on a small held-out set, suggesting that the interface trades insignificant accuracy difference to a black-box model for a verifiable, inspectable decision basis for first-person event-based emotion analysis. We also release EmoExpl-1200 with per-line verification metadata and the full rule set.

43. 【2607.00605】Auditing Forgetting in Limited Memory Language Models

链接https://arxiv.org/abs/2607.00605

作者:Arya Raeesi,Hanna Roed

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Limited Memory Language, Memory Language Models, externalize factual knowledge, Limited Memory, Memory Language

备注: 17 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether a deleted fact persists through residual parametric memory, alternative retrieval paths, or near-neighbor retrieval artifacts. We propose a causal auditing framework that holds the model fixed and varies the database state at inference time across three interventions: FULL, DEL-ON, and DEL-OFF. The framework decomposes post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and a retrieval artifact rate grounded in the inference-time retrieval trace. We apply it to 12,228 alias-closure deletions across thirteen databases, including four adversarial topologies (Base, Alias, Noise, Collision) we construct in three domains, and six prompt formulations. Parametric leakage is near zero in every variant and every prompt style: the model rarely returns the deleted answer in the absence of retrieval. The residual that does survive lives in the retrieval graph: retrieval-mediated correctness and the retrieval artifact rate match within rounding everywhere, so post-deletion correctness is, in our audit, predominantly reconstituted from near-neighbor retrieval. This residual ranges from 0.7% on the released LMLM database to 13.6% on the most adversarial variant, and prompt formulation does not independently control how much of a deleted fact survives. These results suggest that, for this class of LMLM and deletion procedure, the unlearning boundary is drawn primarily by the database administrator rather than by the model.

44. 【2607.00601】"Don't Say It!": Constraints, Compliance, and Communication when Language Models Play Taboo

链接https://arxiv.org/abs/2607.00601

作者:Sara Candussio,Francesca Padovani,Daniel Scalena,Malvina Nissim

类目:Computation and Language (cs.CL)

关键词:Taboo requires describing, Taboo requires, requires describing, game of Taboo, Taboo

备注

点击查看摘要

Abstract:The game of Taboo requires describing a target word without using a set of forbidden words, so that other players can guess it. This deceptively simple task combines strict lexical constraints with the need for communicatively effective descriptions, making it a compelling playground for examining how LLMs navigate competing demands at inference time. We evaluate two open-weight models under conditions that intervene at progressively deeper levels of the generative process, from prompting to generation-time constraints to internal representations manipulations. We assess their outputs through forbidden word violation detection, LLM-as-a-judge measuring the degree to which generated descriptions successfully evoke the target concept for both human and machine guessers, and examining whether the strategies models adopt under constraint align with those of human players. Our results show that compliance with the rules of the game and communicative effectiveness trade off differently across conditions, and that models remain substantially weaker than humans as guessers, suggesting that lexical grounding under constraint is an open challenge for current language models.

45. 【2607.00597】Multi-Turn Agentic Scientific Literature Search via Workflow Induction

链接https://arxiv.org/abs/2607.00597

作者:Jisen Li(1 and 2),Bingxuan Li(1),Nanyi Jiang(3),Xuying Ning(1),Xiyao Wang(3),Yifan Shen(1),Heng Wang(1),Yuqing Jian(2),Xiaoxia Wu(2),Ben Athiwaratkun(2),Pan Lu(4),Jiaxuan You(1),Bingxin Zhao(3) ((1) University of Illinois Urbana-Champaign, (2) Together AI, (3) University of Pennsylvania, (4) Stanford University)

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:search, literature search, Scientific literature search, retrieving papers, users' intents

备注: 17 pages, 12 figures

点击查看摘要

Abstract:Scientific literature search often requires more than retrieving papers from a single query: users' intents are underspecified, preference-dependent, and evolve through interaction. Existing search agents typically rely on fixed pipelines or implicit language-only reasoning, making their search strategies difficult to control, inspect, and refine. We introduce PaperPilot, a multi-turn literature search agent that frames scientific search as workflow induction. Given an anchor paper and a user query, PaperPilot constructs an executable DAG of paper-search operators, including keyword search, citation expansion, filtering, scoring, reranking, and evidence extraction. User feedback is then used to refine both the query and the workflow itself. We train PaperPilot with supervised workflow imitation and preference optimization over controlled workflow corruptions. Experiments show that PaperPilot-9B improves over the base Qwen3.5-9B toolset agent under multi-turn interaction, increasing Hit@5 from 58.0 to 77.0, MRR from 47.5 to 59.4, and nDCG@10 from 26.8 to 32.5, while reducing workflow execution errors from 9.5% to 0%. These results show that explicit, editable search workflows provide an effective and controllable interface for aligning literature search agents with complex scientific intent.

46. 【2607.00588】Low Perplexity is Repetition: A One-Dimensional Self-Conditioning Attractor in Continuous Diffusion LMs

链接https://arxiv.org/abs/2607.00588

作者:Shuai Zhang,Zijie Chen,Hongliang He,Lun Du,Zhenzhong Lan

类目:Computation and Language (cs.CL)

关键词:Continuous diffusion language, ELF report record-low, record-low generative perplexity, report record-low generative, diffusion language models

备注

点击查看摘要

Abstract:Continuous diffusion language models such as ELF report record-low generative perplexity (Gen-PPL). We find a catch: these models repeat far more than human text, and Gen-PPL rewards rather than penalizes that repetition, so its low scores overstate quality. Strip the repetition and ELF-B's Gen-PPL rises from $19.5$ to $27.7$; the smallest model even posts the best Gen-PPL because it repeats most. We trace the repetition to its source: a contractive attractor along a \emph{single direction} in the self-conditioning feedback loop, the loop that feeds each step's clean estimate into the next. Because the failure is one-dimensional, a one-dimensional fix suffices, and we propose one. \textbf{ACE} (Attractor-Contrast-Escape) subtracts that single, label-free direction from the feedback at each step. Estimated once on the $105$M model, the direction cuts repetition to near the human level while keeping quality competitive, and transfers near-unchanged to the $342$M and $652$M models and across samplers; the same recipe recovers useful directions on other architectures. Since Gen-PPL itself rewards repetition, we instead measure the compute each fix needs to produce human-clean text, where ACE is $1.5$--$5\times$ cheaper.

47. 【2607.00576】Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine

链接https://arxiv.org/abs/2607.00576

作者:Jiaxian Lv,Shiyao Cui,Yingkang Wang,Guoxin Wu,Qingling Zhang,Minlie Huang

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multimedia (cs.MM)

关键词:increasingly prevalent form, harmful semantics emerge, multi-image implicit toxicity, social media, giving rise

备注: 15 pages, 8 figures

点击查看摘要

Abstract:Multi-image content has become an increasingly prevalent form of visual communication in social media, giving rise to a new safety issue, multi-image implicit toxicity (MIIT), where each image appears benign in isolation, but harmful semantics emerge when the images are interpreted jointly. MIIT is particularly challenging for existing commercial moderation APIs and models due to the lack of explicit risky cues in each image. This paper aims to study how to identify MIIT. We first provide a formal definition of MIIT and analyze three key challenges for its detection. To alleviate the scarcity of data in this area, we construct MIIT-dataset, an image-only multi-image safety dataset covering seven representative risk categories through an automatic generation pipeline. Finally, we train MiShield with progressively distilled reasoning supervision, enabling it to produce safety judgments accompanied by explicit analyses of the correlated entities that result in the hazards. Experiments show that MiShield-8B models outperform representative moderation services and even larger-scale models, revealing its effectiveness and practical value for this widely used visual format. Warning: This paper contains potentially sensitive content.

48. 【2607.00570】Dual-Confidence Contrastive Decoding for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2607.00570

作者:Raymond Li,Md Tawkat Islam Khondaker,Amirhossein Abaskohi,Gabriel Murray,Giuseppe Carenini,Issam H. Laradji

类目:Computation and Language (cs.CL)

关键词:increasingly requires models, Retrieval-augmented generation, model internal memory, increasingly requires, multiple retrieved documents

备注

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) increasingly requires models to answer questions from multiple retrieved documents, where only some sources are relevant and the retrieved bundle may contain stale, noisy, or conflicting evidence. Existing contrastive decoding methods primarily focus on resolving conflicts between the model's internal memory and the retrieved context. In contrast, we study the complementary problem of intra-context conflict in multi-document RAG. To evaluate this setting, we introduce DRQA, a factual-conflict question answering benchmark derived from enterprise deep-research scenarios, where answers are grounded in synthetic enterprise-specific facts that are designed not to be recoverable from the model's internal memory. We further propose Dual-Confidence Contrastive Decoding (DCCD), a training-free decoding method that combines document-level confidence, which estimates whether a document appears sufficient for answering the question, with token-level confidence, which estimates whether that document supports a confident next-token prediction. DCCD selects positive and negative document-conditioned streams using these dual-confidence signals and scales a document-level contrast by their confidence margin. Across DRQA and standard multi-document QA benchmarks, DCCD achieves the best average performance among full-context and contrastive decoding baselines, with the largest gains on DRQA. These results highlight the importance of source-aware, confidence-gated decoding when retrieved evidence is internally conflicting.

49. 【2607.00502】A Task-State Representation for Long-Horizon Mobile GUI Agents

链接https://arxiv.org/abs/2607.00502

作者:Yujie Zheng,Zikang Liu,Xin Zhao,Ji-Rong Wen

类目:Computation and Language (cs.CL)

关键词:transient screen observations, agents typically rely, separate persistent task, GUI agents typically, persistent task states

备注: Preprint. 9 pages, 3 figures

点击查看摘要

Abstract:While long-horizon mobile GUI agents typically rely on thought-action-observation loops, they struggle to separate persistent task states from transient screen observations. As execution histories grow, this entanglement imposes a severe context burden, causing agents to forget initial requirements, hallucinate progress, or repeatedly interact with stale interfaces. To address this, we introduce Task-State Representation (TSR), a training-free framework that explicitly decouples task state from sensory input. Acting as a lightweight external wrapper, TSR maintains three structured components: a global instruction summary, a dynamic progress tracker for subgoals, and a transition-aware action verifier. By continuously updating through pre- and post-action visual comparisons, TSR effectively guides the agent's reasoning without requiring architectural modifications. Experiments across four mobile GUI benchmarks validate TSR's effectiveness, yielding up to a 12 absolute point increase in success rate on complex cross-application and memory-intensive tasks.

50. 【2607.00501】BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

链接https://arxiv.org/abs/2607.00501

作者:Prabod Rathnayaka,Fabian Waschkowski,Lukas Wesemann

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)

关键词:Apple Silicon, Apple Silicon unified, large language models, native Metal inference, establish Apple Silicon

备注

点击查看摘要

Abstract:We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to date. Existing runtimes, including this http URL and MLX-based frameworks, incur overhead from abstractions not designed for Metal's execution model or Apple Silicon's unified memory topology. By building natively on Metal with chip-specific kernel fusion, unified memory-aware optimisation, and custom dispatch logic, BaseRT recovers performance that framework-based approaches leave on the table. BaseRT supports a wide range of model families across eight quantisation formats (Q2 to FP16) on all Apple M-series devices. In this paper, we evaluate the Qwen3, Llama 3.2, and Gemma 4 families at Q4 and Q8 quantisation on M3 and M4 Pro devices. BaseRT achieves up to 1.56x higher decode throughput than this http URL and up to 1.35x higher than MLX, with substantially larger margins on prefill for mixture-of-experts models, delivering consistent best-in-class throughput from sub-1B to 30B parameter models. These results establish Apple Silicon as a more capable inference platform than previously reported, with direct implications for the emerging edge inference paradigm: as privacy requirements, latency constraints, and cloud cost pressures drive inference toward on-device deployment, performance-optimised local runtimes are a critical enabling layer for this transition. BaseRT is publicly available at this https URL

51. 【2607.00491】MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

链接https://arxiv.org/abs/2607.00491

作者:Leyuan Yu,Xiao Tang,Minghao Liu,Xinyuan Li,Xiaokai Bai,Sheng Zhou,Qunshu Lin,Weihao Xuan,Naoto Yokoya

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:models describe relations, test observational spatial, vision-language models, models describe, test observational

备注: 18 pages, 7 figures. Dataset available at [this https URL](https://huggingface.co/datasets/ZODAOfficial/MindEdit-Bench)

点击查看摘要

Abstract:Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed. Can VLMs instead predict the consequences of hypothetically moving or rotating an object? We introduce MindEdit-Bench, a benchmark of six spatial reasoning tasks built from three-photo smartphone triplets of newly captured indoor scenes via an automatic in-the-wild 3D scene-graph extraction pipeline. Four tasks probe perception and perspective transformation over observed structure; two new tasks, L4 (spatial editing) and L5 (cross-view visibility editing), probe object-level counterfactual reasoning, where correct answers are absent from all input images. Each question provides 8-24 structured answer choices, enabling answer-letter-level diagnosis of spatial and fallback errors. The benchmark covers 120 private indoor scenes not drawn from public datasets, reducing public-data pretraining-overlap risk. Across 15 VLMs on 1,003 human-verified questions, task-wise mean VLM accuracy is only 8%-31%, versus 81%-97% human majority-vote accuracy. The pooled human--best-VLM gap is 53 pp, with at least 39 pp on every task. The structured answer space further reveals non-uniform failures, including weaker camera-depth-axis inference and fallback behavior on difficult visibility-editing cases.

52. 【2607.00485】Efficient Multilingual Reasoning Transfer via Progressive Code-Switching

链接https://arxiv.org/abs/2607.00485

作者:Zhijun Wang,Junxiao Liu,Hao Zhou,Hao-Ran Wei,Baosong Yang,Shujian Huang

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, achieved strong reasoning, strong reasoning capabilities, Large reasoning, performance degrades significantly

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved strong reasoning capabilities in English, yet their performance degrades significantly when required to reason in other languages. A natural solution is to transfer the model's English reasoning ability to target languages. However, existing transfer approaches typically rely on distilled target-language reasoning traces from stronger LRMs or online supervision from external judge models, which are costly and difficult to scale. In this paper, we propose PCS (Progressive Code-Switching), a more efficient transfer framework that requires only lightweight translation without any stronger model for distillation or judging. PCS first constructs code-switched reasoning traces by translating a subset of English reasoning steps into the target language, and uses them to initialize the model's code-switching ability via supervised fine-tuning. It then applies reinforcement learning with a step-level language consistency curriculum, progressively raising the target-language ratio until the model reasons entirely in the target language. This progressive design provides a smooth transfer path that avoids the instability and performance degradation commonly observed when directly enforcing target-language reasoning. Experiments on multiple benchmarks and five typologically diverse languages show that PCS substantially narrows the performance gap between target-language and English reasoning, yielding more language-consistent reasoning while maintaining competitive accuracy.

53. 【2607.00482】Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

链接https://arxiv.org/abs/2607.00482

作者:Chia-Hsuan Lee,Sihui Dai,Mingyang Zhou,Isha Slavin,Shi-Xiong Zhang,Sambit Sahu,William Campbell

类目:Computation and Language (cs.CL)

关键词:models frequently overthink, generating extended chains, language models frequently, approach abandonment, Reasoning language models

备注

点击查看摘要

Abstract:Reasoning language models frequently overthink: generating extended chains of behaviors such as hedging, approach abandonment, and self contradiction that consume tokens without improving answers. We show that these behaviors are not merely a consequence of length; even when controlling for response length, incorrect traces exhibit higher rates of unproductive self-reflection than correct ones. Addressing this requires identifying where self-reflection helps vs hurts, but obtaining these step-level annotations is costly. We observe that intermediate answer commitments within reasoning traces can provide a cheap proxy: by comparing each final answer candidate in the trace to the ground truth, we can determine whether subsequent reflection is productive without any additional supervision. Building on this insight, we propose DASH (Drift Aware advantage SHaping), which assigns segment-level credit based on whether each reasoning segment leads toward or away from correctness. On competition-level math benchmarks, DASH achieves the highest accuracy where overthinking is prevalent (AIME25: 50.8% vs. 45.4% GRPO) while reducing overthinking behaviors and achieving more productive self-correction than baselines.

54. 【2607.00465】StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

链接https://arxiv.org/abs/2607.00465

作者:Yuan Qing,Chengzhi Mao,Boqing Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Visual Instruction Tuning, Large Vision-Language Models, Instruction Tuning, Large Vision-Language, Visual Instruction

备注: Accepted to ECCV 2026. Project page and code: [this https URL](https://yuanqing-ai.github.io/StochasT)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same image for conversational, multi-turn training, whereas existing benchmarks evaluate LVLMs in isolated, single-turn scenarios. The models can suffer from visual attention decay and contextual overfitting during multi-turn training, making it hard for them to realize their full potential in the mismatched test phase. To close the gap, we propose learning with Stochastic Turn Depth (StochasT), which stochastically groups language tasks for the same image into clusters of varying sizes (turn depth) while preserving their organic order. Hence, while StochasT draws on Dropout and stochastic depth for ResNets, it does not actually drop anything to maximize the utility of the training data. Furthermore, we introduce a challenging, benchmark-agnostic evaluation mechanism based on the Balanced Latin Square to measure LVLMs' robustness under varying contextual dependencies. Extensive experiments demonstrate that StochasT effectively grants LVLMs strong, harmonized capabilities for both single-turn and multi-turn use cases.

55. 【2607.00464】MolSafeEval: A Benchmark for Uncovering Safety Risks in AI-Generated Molecules

链接https://arxiv.org/abs/2607.00464

作者:Tong Xu,Xinzhe Cao,Zhihui Zhu,Keyan Ding,Huajun Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:emphasize task complexity, critical concern, potential safety risks, largely overlook, overlook a critical

备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Current molecular generation benchmarks emphasize task complexity, molecule novelty, and property alignment; they largely overlook a critical concern: the potential safety risks of AI-generated molecules. In practice, many generative models may produce molecules with toxic, reactive, or otherwise hazardous characteristics - posing hidden dangers that remain insufficiently addressed. To address this gap, we introduce MolSafeEval, a benchmark dedicated to evaluating and analyzing the safety risks of molecular generation. Unlike prior approaches that rely on narrow toxicity predictors, MolSafeEval integrates heterogeneous safety knowledge - ranging from toxicological databases to hazard rules - into a structured molecular safety knowledge graph. This graph serves as a foundation for large language model-based reasoning, enabling systematic detection and explanation of unsafe features in generated compounds. We further categorize molecular generative models into four representative task types - unconditional generation, property optimization, target protein-based design, and text-based generation - and provide standardized datasets and safety evaluation protocols for each. By systematically revealing the safety vulnerabilities of current generative approaches, MolSafeEval offers a new lens for benchmarking molecular models and provides essential guidance toward safer, more trustworthy molecular design.

56. 【2607.00447】Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

链接https://arxiv.org/abs/2607.00447

作者:Yangfan Hu,Xuhan Tong,Haoyue Bai,Xi Ding,Shashank Muralidhar Bharadwaj,Siyang Cao,Robert Nowak,Jiawei Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, produce hallucinated answers, violate prompt-level constraints, produce hallucinated

备注: Project page: [this https URL](https://neohughus.github.io/Understanding_Why_Language_Models_Hallucinate/)

点击查看摘要

Abstract:Large language models often produce hallucinated answers that violate prompt-level constraints. A key diagnostic question is whether these failures reflect missing knowledge, or whether the model has the relevant information but follows the wrong inference path. We study this phenomenon as inference misalignment: a mismatch between the answer supported by the prompt and the answer favored by statistically salient latent associations. We formalize this view with a latent key-task model, in which pretraining-frequency imbalance can cause a shortcut path to dominate the constraint-sensitive path and induce positive inference loss. The framework predicts two failure modes: task-retrieval bias in entity disambiguation and key-selection bias in action choice. We introduce TrapQA, a controlled diagnostic testbed with two components. ScientistQA tests disambiguation among similar scientists with supplementary factual probes, while Real-Life Constrained QA tests everyday constraint following under salient shortcuts. Our results show that hallucination can arise from biased latent inference rather than absent knowledge alone.

57. 【2607.00423】Selective Test-Time Debiasing for CLIP via Reward Gating

链接https://arxiv.org/abs/2607.00423

作者:Jaeho Han,Jisoo Yang,Hyeondong Woo,Mingyu Jeon,Sunjae Yoon,Junyeong Kim

类目:Computation and Language (cs.CL)

关键词:Vision language models, yielding skewed demographic, skewed demographic distributions, Vision language, perpetuate social stereotypes

备注: 15 pages, 7 figures, 11 tables

点击查看摘要

Abstract:Vision language models (VLMs) demonstrate strong zero-shot performance, but often perpetuate social stereotypes in person-centric queries, yielding skewed demographic distributions. Current debiasing methods apply uniform bias corrections across all input queries regardless of their bias sensitivity, creating a fundamental fairness--utility trade-off. Strong debiasing distorts semantically meaningful information in bias-insensitive queries, while weak debiasing fails to mitigate stereotypes in bias-sensitive ones. This one-size-fits-all approach hampers simultaneously achieving high utility on bias-insensitive queries and fairness on bias-sensitive queries. We introduce Reward-Gated Test-Time Adaptation (RG-TTA), a reinforcement learning-based test-time adaptation framework that selectively applies debiasing based on input sensitivity. RG-TTA adaptively triggers fairness regularization based on the bias sensitivity of each input during test-time policy adaptation, while focusing exclusively on optimizing cross-modal alignment for bias-insensitive inputs. Experiments on fairness benchmarks (e.g., FairFace, UTKFace) demonstrate substantial bias reduction while simultaneously improving zero-shot utility, resolving the trade-off of uniform debiasing.

58. 【2607.00418】Speech Playground: An Interactive Tool for Speech Analysis and Comparison

链接https://arxiv.org/abs/2607.00418

作者:Stephen McIntosh,Daisuke Saito,Nobuaki Minematsu

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:paper presents Speech, presents Speech Playground, Speech Playground, interactive speech visualization, paper presents

备注: Accepted to Interspeech 2026 (Show and Tell); 2 pages, 3 figures

点击查看摘要

Abstract:This paper presents Speech Playground, an interactive speech visualization and comparison tool. While existing tools such as Praat are excellent, it can be cumbersome to integrate them with modern deep learning representations and use them for comparison. Speech Playground addresses this by combining a Python backend with a web-based frontend for interactive exploration of multiple feature types, including continuous, discrete, and variable-length representations. It includes TextGrid and forced alignment support together with configurable distance and alignment settings for visual and auditory comparison. Speech Playground is intended for use in speech research, representation validation, and computer-aided pronunciation training (CAPT)-oriented experimentation.

59. 【2607.00415】A Mechanistic View of Authority Hierarchy in LLM Sycophancy

链接https://arxiv.org/abs/2607.00415

作者:Emil Joswin,Srujananjali Medicherla,Priyanka Mary Mammen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:systematically prioritize social, prioritize social cues, models systematically prioritize, critical safety concern, factual consistency

备注

点击查看摘要

Abstract:Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying their answers based on source credibility rather than evidence. We mechanistically investigate this phenomenon using a controlled medical QA setting, where hints suggesting incorrect answers are attributed to personas of varying expertise. Across Llama-3.1-8B, Qwen3-8B, and Gemma-2-9B, we find that models respond in a graded manner proportional to perceived authority, a hierarchy that is never explicitly prompted but emerges from training. Logit lens analysis and linear/non-linear probing localize this effect to a critical late layer where correct answer representations are actively erased, an erasure that scales with authority level, resists mean vector intervention, and is only partially reversible through chain-of-thought reasoning. Our findings suggest that authority-induced sycophancy is not a surface-level output bias but mechanistic knowledge erasure, a precise, layer-localized overwriting of correct internal representations by high-status authority signals.

60. 【2607.00394】When Classic Cache Policies Fail: Learning-Augmented Replacement for Semantic Retrieval Buffers

链接https://arxiv.org/abs/2607.00394

作者:Yushi Sun,Bowen Cao,Wai Lam

类目:Databases (cs.DB); Computation and Language (cs.CL)

关键词:LLM agents increasingly, reuse past experience, remain largely ad-hoc, buffers remain largely, agents increasingly rely

备注

点击查看摘要

Abstract:LLM agents increasingly rely on retrieval buffers to store and reuse past experience, yet the cache management policies governing these buffers remain largely ad-hoc. We formalize this as an online semantic cache replacement problem with switching costs, where items are matched by embedding similarity and hit quality is continuous rather than binary. Through experiments on two datasets from MemoryBench-Full (LoCoMo, DialSim) with 8 replacement policies, we reveal a surprising finding: classic heuristics (LRU, LFU) \emph{consistently underperform} the naive FIFO baseline on semantic workloads, due to the absence of temporal locality and frequency concentration. We propose SOLAR, a learning-augmented framework that derives modification timing from regret accumulation (achieving $\sim$17\% modification rate) and content selection from Bayesian online learning over implicit retrieval feedback. We prove SOLAR achieves a constant competitive ratio $\leq 3$, independent of cache size and horizon (vs.\ $\Omega(K)$ for FIFO), and eviction regret $O(\sqrt{KT\log T})$, matching the $\Omega(\sqrt{KT})$ lower bound up to logarithmic factors. Experiments demonstrate 5--75\% relative improvement over FIFO at tight cache sizes, with a clearly characterized phase transition at the working set boundary. Synthetic experiments with 5000-item pools further reveal an inverted-U relationship between pool size and retrieval quality, justifying capacity constraints as a retrieval noise phenomenon rather than a storage limitation.

61. 【2607.00374】Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

链接https://arxiv.org/abs/2607.00374

作者:Jingjing Zhang,Lei Zhang,Zheren Fu,Zhendong Mao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Composed Image Retrieval, Image Retrieval, reference image, Image, Composed Image

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text pairs. However, existing proxy tasks primarily enhance visual and textual representations to accommodate a predefined composition mechanism such as pseudo-word injection into a frozen text encoder or linear feature arithmetic. As a result, the composition function itself remains unlearned, limiting the model's ability to express diverse and fine-grained semantic modifications. To address this, we propose FoCo, which models composition as two coordinated stages: focusing on modification-relevant visual content, and then completing the target semantics. We realize these through two proxy tasks: text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics, and context-conditioned semantic completion to transform these aggregated visuals with the remaining scene context into a coherent composed representation. The tasks are trained jointly with a cross-instance contrastive objective, encouraging semantic diversity and discouraging shortcut composition strategies. Extensive experiments on four ZS-CIR benchmarks show FoCo's state-of-the-art performance and improved generalization.

62. 【2607.00368】Beyond Perplexity: A Behavioral Evaluation Framework for Deployment-Memory Claims in LLM Test-Time Training

链接https://arxiv.org/abs/2607.00368

作者:Xiangchen Song,Zhenhao Chen,Lingjing Kong,Shaoan Xie,Xinshuai Dong,Guangyi Chen,Kun Zhang

类目:Computation and Language (cs.CL)

关键词:verifiable task attempts, Large language model, Large language, model test-time training, TTT memory claims

备注

点击查看摘要

Abstract:Large language model test-time training (TTT) is often evaluated through local proxy metrics: models are updated on recent tokens, retrieved context, target-domain data, or verifiable task attempts, and then judged by perplexity, future-token loss, long-context performance, or reward. These metrics are well matched to claims about stream adaptation, domain adaptation, context compression, and reward-backed test-time improvement. They are weaker evidence, however, for a capability that TTT results are increasingly used to motivate: deployed assistant memory, personalization, or sparse post-deployment learning, which instead requires behavioral evidence such as later recall, paraphrase robustness, retention, locality, conflict handling, and use in downstream actions after the original support context is removed. We introduce a behavioral evaluation framework that calibrates TTT memory claims to the evidence that supports them. It has two components: a claim-calibrated evidence ladder that separates stream/domain adaptation, bridge internalization, and deployment-time behavioral learning; and an evaluation protocol with matched explicit-memory baselines and mutually exclusive failure categories. We validate the framework by auditing recent TTT and memory-adjacent work and by instantiating it as a controlled diagnostic in which, in a sparse nonce-fact setting, one-step LoRA updates lower support and answer loss across three Qwen3 model scales while generated free-form recall stays at zero, exposing a measurable gap between proxy improvement and deployment behavior. The framework gives authors and evaluators a concrete standard for aligning TTT memory claims with the evidence actually reported.

63. 【2607.00341】DiscoLoop: Looping Discrete Embeddings and Continuous Hidden States for Multi-hop Reasoning

链接https://arxiv.org/abs/2607.00341

作者:Hengyu Fu,Tianyu Guo,Zixuan Wang,Hanlin Zhu,Jason D. Lee,Jiantao Jiao,Stuart Russell,Song Mei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, externalize intermediate steps, single forward pass, models achieve strong, allowed to externalize

备注: 16 pages, 7 figures

点击查看摘要

Abstract:Large language models achieve strong performance on many reasoning tasks when allowed to externalize intermediate steps as Chain-of-Thought (CoT). However, many questions require the model to internalize the multi-step reasoning within a single forward pass before generating the answer. We study this challenge through two-hop reasoning, a representative task where the model must compose multiple pieces of parametric knowledge within a single forward pass. Standard non-recurrent Transformers suffer from a depth-local storage problem: facts learned in earlier layers are unavailable where second-hop retrieval happens. We found that Looped Transformers mitigate this issue by reusing the same memory, but still generalize imperfectly. We show that the remaining bottleneck is representational. In the two-hop reasoning task, the first loop often makes the correct bridge entity nearly perfectly decodable, yet the corresponding hidden state remains poorly aligned with the bridge token embedding. Surprisingly, an easy training-free realignment intervention nearly closes the generalization gap. Building upon this insight, we propose DiscoLoop, a looping architecture whose recurrence carries both a discrete embedding channel and a continuous hidden-state channel. DiscoLoop achieves near-perfect accuracy with substantially fewer training steps across symbolic and synthetic-language multi-hop reasoning tasks. When applied to real-world pretraining, DiscoLoop attains lower training loss and stronger benchmark performance than looped-transformer baselines, suggesting that the mixed-channel design transfers to practical language modeling.

64. 【2607.00339】RACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data

链接https://arxiv.org/abs/2607.00339

作者:Maolin Wang,Yu Wang,Zichun Liu,Baiyuan Qiu,Chenbin Zhang,Jiguang Shen,Haoran Yang,Hao Miao

类目:Computation and Language (cs.CL)

关键词:persistent source, source of user, user state, state for long-running, long-running assistants

备注

点击查看摘要

Abstract:Conversational data is increasingly used as a persistent source of user state for long-running assistants and AI agents. However, querying this data remains challenging because conversations naturally evolve: plans are revised, preferences change, and later messages frequently supersede or contradict earlier information. Existing long-memory pipelines largely treat memories as independent text or vector objects. This approach often retrieves semantically similar but stale evidence, offering limited support for state-aware reasoning. To address this problem, we present TRACE, a query processing framework over temporal evidence graphs for evolving conversational data. TRACE models conversations as a hierarchical graph spanning events, sessions, and topics, enriched with typed temporal, causal, update, and contradiction relations. Crucially, the framework maintains validity annotations so obsolete facts remain accessible for historical queries but are discounted for current-state answers. At query time, TRACE combines vector-based note retrieval with graph-guided evidence search, generating validity-aware support paths and a hybrid context for answer generation. This design separates lexical recall from evidence reconstruction, enabling bounded query-time reasoning over long conversational histories. Experiments on long-conversation query-answering (QA) benchmarks show that TRACE improves temporal and multi-hop reasoning, with ablations highlighting the importance of hierarchy, update-aware seeding, and path-grounded evidence.

65. 【2607.00325】Watermarking for Proprietary Dataset Protection

链接https://arxiv.org/abs/2607.00325

作者:John Kirchenbauer,Brian R. Bartoldson,Bhavya Kailkhura,Tom Goldstein

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:fundamentally hard tasks, language modeling settings, modern language modeling, modeling settings, growing body

备注: 8 pages and 6 figures in the main body; presented at the ICML 2026 Workshop on Trustworthy AI for Good

点击查看摘要

Abstract:A growing body of literature suggests that training data membership inference problems are fundamentally hard tasks in modern language modeling settings. We argue that output watermarking techniques are the right gadget to make training membership tests for generative models more tractable, based on prior results showing that language models exhibit residual watermark "radioactivity" under partially watermarked training datasets. We pit a watermark-based dataset inference approach head-to-head against traditional loss-based membership inference methods and show that watermarking can achieve comparable membership detection performance when subset exposure is high enough, under an alternate set of assumptions.

66. 【2607.00309】A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

链接https://arxiv.org/abs/2607.00309

作者:Prabal Gupta(Rama Labs, Kitchener, Canada)

类目:ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:evolving procedural soundscapes, converts natural-language scene, natural-language scene descriptions, real-time musical interface, procedural soundscapes

备注: 10 pages, 7 figures, 2 tables. Accepted to the International Conference on New Interfaces for Musical Expression (NIME 2026), London, UK. Supplementary material included as an appendix. Code and demo: [this https URL](https://github.com/prabal-rje/latentscore)

点击查看摘要

Abstract:We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments - stepping brightness down, switching a rhythm style - each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends - embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model - all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound - reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.

67. 【2607.00304】Mapping the Evaluation Frontier: An Empirical Survey of the Bias-Reliability Tradeoff Across Eleven Evaluator-Agent Conditions

链接https://arxiv.org/abs/2607.00304

作者:Zewen Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:LLM evaluation systems, bias-reliability tradeoff conjectures, fixed sample size, conjectures that LLM, LLM evaluation

备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:The bias-reliability tradeoff conjectures that LLM evaluation systems are constrained in (gamma, H, CV) space, where evaluator coupling (gamma), strategy diversity (H), and small-sample measurement reliability (CV(N)) cannot be simultaneously optimized at fixed sample size N. Prior evidence rests on n=5 conditions with complete metrics from a single study. We expand the empirical base to 11 conditions, measuring gamma and H for all 11 (nine with valid weight vectors) and CV(N=5) for seven with sufficient seeds (N = 5). Five conditions provide the complete (gamma, H, CV) triple. The data confirm the trade-off: conditions with low evaluator coupling (gamma 0.2) exhibit high measurement noise (CV(N=5) 1.0), while conditions with strong coupling (gamma 0.9) achieve low noise (CV(N=5) 0.16). The correlation r(H, gamma) = -0.989 (n=5, excluding GPT-4o conditions) confirms that evaluator coupling suppresses strategy diversity. Four GPT-4o conditions show gamma=0.000 and H=1.000 across all seeds -- a pattern we attribute to version drift in the June 2026 GPT-4o API. No condition occupies the region {gamma 0.2, CV(N=5) 0.3}. We release all per-condition metrics as a standardized benchmark dataset for evaluator comparison.

68. 【2607.00297】EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems

链接https://arxiv.org/abs/2607.00297

作者:Zewen Liu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:LLM agents, agent strategy distribution, evaluator preference coupling, evaluator biases propagate, preference coupling

备注: 10 pages, 3 tables

点击查看摘要

Abstract:When LLM agents use evaluator feedback to adapt their behavior in closed loops, evaluator biases propagate through the agent's strategy distribution -- a phenomenon known as evaluator preference coupling. Prior work has documented coupling across multiple evaluator families and model versions, but the field lacks a standardized protocol that enables third-party researchers to (i) reproduce coupling measurements, (ii) compare results across evaluators and time points, and (iii) detect measurement decay as proprietary evaluators silently update. This paper provides the protocol. We specify EPC (Evaluator Preference Coupling) -- a detailed, RFC-style protocol specification for the four-phase isolation paradigm, covering executor and evaluator configuration, strategy and task design, the TTRL update rule, metric computation (gamma, JSD, ECE, Brier), and output schema. We accompany the protocol with a versioned Reference Snapshot v1.0: coupling measurements for eight evaluator conditions (N=122 unique experimental repetitions across GPT-4o, Qwen, DeepSeek, and others) derived from five independent studies, annotated with evaluator version identifiers, API endpoints, and measurement dates. The snapshot is explicitly time-bound: all values are conditional on specific model versions and are expected to decay as proprietary evaluators update. We define a versioning convention (vX.Y-Z, encoding protocol version, snapshot version, and evaluator generation) and provide a usage guide covering adoption, interpretation, and known pitfalls. The protocol, reference snapshot, and implementation code are released as open infrastructure.

69. 【2607.00293】Rosetta: Composable Native Multimodal Pretraining

链接https://arxiv.org/abs/2607.00293

作者:Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Achieving true artificial, true artificial general, artificial general intelligence, general intelligence requires, Achieving true

备注

点击查看摘要

Abstract:Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete understanding tasks causes severe gradient conflicts. Existing architectures, including standard Mixture-of-Experts (MoE), are highly susceptible to representation overwriting. Even structurally partitioned paradigms like Mixture-of-Transformers (MoT) remain vulnerable to catastrophic forgetting, severely impeding multimodal scalability. In this work, we introduce Rosetta, a composable native multimodal pretraining framework designed for seamless and non-destructive modality expansion. Rosetta adopts a modular paradigm where core foundational knowledge is preserved within global shared experts, while modality-specific capabilities are distributed across plug-and-play experts. To guarantee non-destructive composition, we propose Momentum-Anchored Orthogonal Projection (MAOP). MAOP leverages the optimizer's momentum state as an implicit semantic anchor, selectively neutralizing conflicting gradient components from new modalities while preserving synergistic updates. Extensive evaluations demonstrate that, while standard MoE and MoT architectures suffer catastrophic forgetting of previously acquired knowledge, Rosetta robustly preserves established language and visual understanding. Furthermore, it delivers superior image generation and unlocks cross-modal synergy, paving the way for truly composable and unified multimodal foundation models. To facilitate further multimodal research, we release our code and checkpoints to the community. Project page at this https URL.

70. 【2607.00292】An LLM-Based Framework for Intent-Driven Network Topology Design

链接https://arxiv.org/abs/2607.00292

作者:Kholoud El-Habbouli,Fen Zhou,Stephane Huet

类目:Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:natural language requirements, language requirements remains, Designing deployable, Large Language Models, resilient network topologies

备注: submitted to IEEE CNSM 2026

点击查看摘要

Abstract:Designing deployable and resilient network topologies from natural language requirements remains a challenging problem in network automation. This work investigates the ability of Large Language Models (LLMs) to generate structurally valid and constraint-compliant network topologies through a constraint-driven pipeline combining hierarchical modeling and systematic validation. The framework is evaluated via a multimodel comparison of proprietary and open-weight LLMs across four realistic network scenarios released as a public dataset. We assess structural correctness using node and edge F1-scores against reference topologies, and evaluate resilience through server and content connectivity metrics. In addition, we analyze common failure modes, including interface mismatches and directional inconsistencies in generated topologies. Overall, this work provides a systematic benchmark for understanding how LLMs handle structural and resilience constraints in topology synthesis, and supports informed model selection for AI-driven network design.

71. 【2607.00276】sting Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

链接https://arxiv.org/abs/2607.00276

作者:Dong Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:distinguish genuine reasoning, model reasoning breaks, familiar problem patterns, Decay World, genuine reasoning

备注: 37 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning breaks down. We introduce an auditable four-stage diagnostic that evaluates whether an LLM can reason inside an unfamiliar physics framework through induction, formulation, prediction, and review. The diagnostic combines locked pre-registrations, fresh sessions between stages, dual-LLM judging, and a human-audit pathway, and we apply it to three parallel physics worlds: a single-equation counterfactual world ($F=mv$), a historical framework (Aristotelian mechanics), and a four-domain counterfactual world (Decay World). Across Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro, the three worlds yield composite PASS rates are 6/15, 6/15, and 0/15 respectively (content $\land$ structural for $F=mv$ and Aristotelian, content axis only for Decay World where the structural axis is out of scope). The most pointed empirical pattern is a qualitative-versus-quantitative asymmetry: in Decay World, models almost never predict the wrong direction of change, but frequently compute the wrong ratio by slipping back to standard-physics relations. The protocol also surfaces two methodology findings: LLM-judge reliability does not transfer across frameworks, and Stage 4 self-review is weak in every framework, with the model's own review wrongly reporting no earlier error in at least two-thirds of the trials that actually contained one. We release the full prompts, responses, verdicts, and audit records.

72. 【2607.00274】SEFORA: Student Essays with Feedback Corpus and LLM Feedback Evaluation Framework

链接https://arxiv.org/abs/2607.00274

作者:Shayan Peyghambari Oskoui,Norah Almousa,Zhaoyi Joey Hou,Carolina Gustafson,Gayle Rogers,Raquel Coelho,Diane Litman,Xiang Lorraine Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Effective writing feedback, Effective writing, student learning, scale is labor-intensive, strongest drivers

备注: Under review for EMNLP 2026

点击查看摘要

Abstract:Effective writing feedback is among the strongest drivers of student learning, yet producing it at scale is labor-intensive. LLMs offer a natural path to scaling writing support, but two gaps stand in the way: few public corpora capture how instructors actually deliver feedback in real classrooms, and no reliable method measures whether generated feedback aligns with what an instructor would write. We address both. SEFORA is a public corpus pairing instructor inline feedback with assignment prompts, rubrics, scores, and multi-draft revisions across various college writing genres, comprising 564 drafts and 8,240 instructor annotations. UniMatch is a reference-based evaluation framework for open-ended generation: it segments feedback into feedback units, scores their semantic correspondence under instructor-derived criteria, and aligns them via optimal matching to yield interpretable precision, recall, and F1. Across 74 experimental configurations spanning multiple LLMs, no setting exceeds 0.4 F1. UniMatch reveals that models struggle to identify the feedback instructors would prioritize, and performance degrades as models generate more.

73. 【2607.00250】LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

链接https://arxiv.org/abs/2607.00250

作者:Adam Darmanin

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:real labelled PDF, labelled PDF corpus, large OCR benchmarks, pretrained language models, decent text corpora

备注: 8 pages, 1 figure, 3 tables. System paper for the DocEng 2026 Maltese Paragraph OCR Competition

点击查看摘要

Abstract:Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what paragraph-level training needs: low-resource for OCR specifically. With no real corpus to train on at scale, we built a synthetic training pipeline and a 5-stream Tesseract LV-ROVER ensemble, and report results on a 422-paragraph benchmark against a fine-tuned-Tesseract baseline of character error rate (CER) 0.0234. Ensemble recognition alone improves CER by 44 percent, to 0.01317; a five-stage post-processing chain brings the full pipeline to CER 0.00700, a 70 percent reduction. Most of that chain is typographic normalisation, but one stage recovers misread diacritics rather than aligning punctuation, so we report it as a recognition gain rather than folding the whole chain under one label. We treat the 44 percent figure as the portable estimate of what the recogniser learned, and the 70 percent figure as specific to this benchmark's label convention.

74. 【2607.00233】From Signals to Structure: How Memory Architecture Drives Language Emergence in LLM Agents

链接https://arxiv.org/abs/2607.00233

作者:Yashar Talebirad,Eden Redman,Ali Parsaee,Osmar R. Zaiane

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Multiagent Systems (cs.MA)

关键词:invent a shared, capacity, Lewis signaling game, agents, channel capacity

备注

点击查看摘要

Abstract:How do two agents invent a shared language from scratch? In a Lewis signaling game, a sender and receiver must coordinate on a code using only their interaction history. We study five memory architectures across varying channel configurations with LLM agents and find that memory architecture matters more than channel capacity. Agents with a persistent private notebook benefit from surplus channel capacity and avoid the high-capacity collapse seen in stateless agents, achieving the most reliable coordination ($0.867 \pm 0.023$ at capacity = 25). Stateless agents peak at moderate capacity and then degrade as the vocabulary grows beyond what a rolling context window can track The notebook externalizes learned conventions, freeing agents from having to re-derive codes each round. An information bottleneck-inspired argument predicts an optimal capacity equal to the number of objects. Instead, the bottleneck (capacity = 8) proves to be a fragility point, and surplus capacity is generally better. We show that channel capacity alone cannot predict coordination; memory architecture determines whether agents turn interaction history into stable conventions, and both dimensions are needed to understand how signals become language.

75. 【2607.00208】SLIM-RL: Risk-Budgeted Random-Masking RL for Diffusion LLMs Without Trajectory Slicing

链接https://arxiv.org/abs/2607.00208

作者:Ruikang Zhao,Zhenting Wang,Han Gao,Ligong Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:diffusion large language, large language models, Reinforcement learning, learning for diffusion, diffusion large

备注: 17 pages

点击查看摘要

Abstract:Reinforcement learning for diffusion large language models (dLLMs) has largely moved to trajectory-aware methods. The current state of the art, TraceRL, holds that random masking is mismatched with the model's inference trajectory, and it reconstructs that trajectory during training by slicing each rollout into up to K/s trajectory-aligned training samples, a cost that grows with the block size K. We show that this mismatch can be mitigated without reconstructing the trajectory. Our method, SLIM-RL, bounds the commit risk of each rollout step with a tau-budget decoder, reducing aggregate commit risk in the training data. During optimization, SLIM-RL trains on these risk-controlled rollouts with a trace-free random-masking objective that adapts variance-reduction tools, combining sequence-level importance sampling, deterministic quadrature over masking levels under a mean-preserving, monotonically decreasing per-block mask schedule that we introduce. On SDAR-4B, SLIM-RL matches TraceRL's best MATH500 accuracy on only 0.46x its training samples at block size 16, improving over TraceRL by 6.32% on MATH500 and 11.05% on GSM8K under matched dynamic sampling. At block size 4, the 4B SLIM-RL surpasses the larger LLaDA-8B and Dream-7B dLLMs on math, exceeding LLaDA-8B by 10.76% on MATH500 while staying below the autoregressive Qwen2.5-7B. On code, it improves over TraceRL by 4.20% on MBPP and 3.65% on HumanEval. The tau-budget decoder transfers training-free across LLaDA, Dream, and SDAR. The source code is available at this https URL .

76. 【2607.00185】Structural Pattern Mining in Inka Khipus: Unsupervised Clustering, Provenance Classification, and a Computational Validation of the Santa Valley Match

链接https://arxiv.org/abs/2607.00185

作者:Maria Contreras

类目:Computation and Language (cs.CL)

关键词:system remains undeciphered, knotted cord devices, Open Khipu Repository, Inka Empire, primary recording medium

备注: 10 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Khipus--knotted cord devices--were the primary recording medium of the Inka Empire (c. 1400-1532 CE), yet their system remains undeciphered. We present a reproducible machine-learning pipeline applied to the Open Khipu Repository (OKR), a public database of 619 khipus comprising 54,403 cords and 110,677 knots. We engineer 27 structural features per khipu and apply (i) unsupervised clustering via UMAP and HDBSCAN, recovering three structurally distinct groups (silhouette = 0.769); (ii) supervised provenance classification via gradient boosting, reaching F1 = 0.86 for the Inka Late Horizon imperial style; and (iii) SHAP-based interpretability, which identifies cord twist direction as the dominant structural discriminator of imperial khipus. We further report two findings of methodological interest. First, one cluster is dominated not by a geographic region but by nineteenth-century European museum collections, indicating that colonial acquisition and recording practices are structurally encoded in the corpus. Second, we provide an independent computational verification of the recto/verso (moiety) structure of the six Santa Valley khipus reported by Medrano and Urton (2018), reproducing both the aggregate attachment ratio and the identification of the single mixed specimen--using only the public OKR database, without physical access to the objects. We additionally report a negative result: knot-type sequence order, encoded as n-grams, adds no provenance signal beyond aggregate features. All code and data are openly available.

77. 【2607.00171】ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

链接https://arxiv.org/abs/2607.00171

作者:Andrianos Michail,Stylianos Psychias,Michelle Wastl,Simon Clematide,Rico Sennrich,Juri Opitz

类目:Computation and Language (cs.CL)

关键词:semantic similarity tasks, similarity tasks, open challenge, evaluation remains, remains an open

备注

点击查看摘要

Abstract:Text embeddings are standard for semantic similarity tasks, yet their evaluation remains an open challenge. Current benchmarks are static, cover only a limited set of languages, are often domain-specific, susceptible to overfitting, and poorly representative of low-resource languages. To address these limitations, we introduce ALEE, a framework that extends Sentence Smith (Li et al., 2025) to the cross-lingual and paragraph level. ALEE uses Abstract Meaning Representations (AMR) to generate English minimal pairs with controlled, fine-grained semantic shifts, which are paired with translations in target languages. This approach enables targeted diagnostics for models in any language with English parallel data. We conduct a large-scale empirical study across a diverse set of embedding models and 275+ languages spanning three parallel datasets. On ALEE, performance varies substantially across languages, text lengths, and linguistic phenomena, exposing persistent gaps in cross-lingual semantic representation that track language prevalence in training resources and subword tokenization. We release ALEE at this https URL

78. 【2607.00159】Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

链接https://arxiv.org/abs/2607.00159

作者:Qian Ma,S M Rayeed,Charles V. Stewart,Qiong Wu,Yao Ma

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Visual Question Answering, Visual Language Models, external structured knowledge, existing KB-VQA benchmarks, Visual Language

备注: Accepted to ECCV 2026. The datasets and code are available in [this https URL](https://github.com/VAN-QIAN/ECCV26-ARA)

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge-grounded reasoning. However, for existing KB-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation. In this work, we show that these assumptions are systematically violated in existing KB-VQA benchmarks. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric. Furthermore, we find that existing datasets rely on visually trivial, single-entity scenes that bypass the need for sophisticated visual-to-knowledge mapping. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities. To address this, we introduce (1) a principled audit-and-repair protocol that restores answer derivability and question clarity, and (2) a controlled multi-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning. Re-evaluation under corrected and augmented settings yields markedly different performance trends. Our findings call for rethinking evaluation protocols and designing more interaction-aware KB-VQA benchmarks that prioritize verifiable reasoning over simple matching.

79. 【2607.00158】Readable but Not Controllable: Neuron-Level Evidence for Medical LLM Hallucination

链接https://arxiv.org/abs/2607.00158

作者:Vijay Vankadaru,Asha Matthews,Tanya Roosta,Peyman Passban

类目:Computation and Language (cs.CL)

关键词:deploying medical LLMs, central obstacles, obstacles to deploying, Hallucination, medical LLMs

备注

点击查看摘要

Abstract:Hallucination remains one of the central obstacles to deploying medical LLMs. Yet, even when hallucination can be detected, it is still unclear whether the internal representations associated with it can be used for control rather than detection alone. Using four open-source models across a suite of medical question-answering datasets, we show that a simple, carefully conditioned probe can reliably detect hallucination, with AUROC scores between 0.77 and 0.86 in our case. We further show that this signal is distributed and redundant rather than narrowly localized. Systematically selected neurons outperform random neurons only at very small subset sizes, whereas random subsets of a few hundred neurons recover nearly the full signal, and low-dimensional random projections preserve most of the detection performance. Beyond detection, we test whether this representation is causally actionable. Across 16 model--dataset combinations, our results reveal a sharp gap between decodability and controllability. The same internal structure that makes hallucination easy to detect does not translate into reliable neuron-level control. These findings show that medical hallucination seems to be readily visible in internal activations, but not easily corrected by steering the neurons most associated with it. More broadly, our results suggest that hallucination mitigation is not simply a matter of identifying the right neurons, and point to a deeper separation between what representations reveal and what they allow us to change.

80. 【2607.00152】GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

链接https://arxiv.org/abs/2607.00152

作者:Yong Yi Bay,Kathleen A. Yearick

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:popular methods, Relative Policy Optimization, Sampling Policy Optimization, training language models, Policy Optimization

备注: 18 pages, 10 figures, 4 tables. Code and data: [this https URL](https://github.com/bay-yearick-lab/grpo-standard-deviation-identity)

点击查看摘要

Abstract:Three of the most popular methods for training language models to reason look like three different tricks. They are not. All three adjust a single number: standard deviation, reflecting how much a prompt's sampled answers disagree. When such a model is trained, it answers each problem many times, and an automatic checker marks every answer right or wrong. The standard deviation of those marks measures the disagreement: largest when the answers split evenly between right and wrong, and zero when they all agree. Group Relative Policy Optimization (GRPO) divides by this number, GRPO Done Right (Dr. GRPO) drops the division, and Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) discards the groups where it is zero. Each is presented as its own fix, yet this paper proves they are three settings of one dial. That dial is not cosmetic: for right-or-wrong rewards, the disagreement is exactly the size of the training update, the group-standard-deviation identity. A split group teaches the most, while a unanimous group teaches nothing and falls silent. The same result says which problems deserve the most weight and how many tries each one needs. This paper confirms the intuition on a large real difficulty dataset (Big-Math) and in a controlled training run. What looks like a harmless normalization step is the dial that decides where learning happens and how strongly.

81. 【2607.00143】Hate Speech Detection in Turkish and Arabic Languages: A Comprehensive Study

链接https://arxiv.org/abs/2607.00143

作者:Somaiyeh Dehghan,Gökçe Uludoğan,Mehmet Umut Şen,Elif Erol,Arzucan Özgür,Berrin Yanikoglu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:hate speech, violence against minorities, mass shootings, global rise, rise in violence

备注: 11 Tables

点击查看摘要

Abstract:Online hate speech has been linked to a global rise in violence against minorities, including incidents such as mass shootings, lynchings, and ethnic cleansing. Societies grappling with this issue, particularly when hate speech targets specific groups based on religion, race, ethnicity, culture, nationality, or migration status, face the challenge of balancing freedom of expression with the need for effective content moderation on widely used online platforms. In response to this challenge, we introduce a comprehensive hate speech dataset covering five distinct topics in Turkish: refugees, the Israel-Palestine conflict, anti-Greek sentiment in Turkey, ethnic or religious communities (Alevis, Armenians, Arabs, Jews, and Kurds), and LGBTI+, alongside one topic in Arabic (refugees). In addition, we develop state-of-the-art BERT-based models to address multiple dimensions of hate speech analysis, including hate category classification, hate intensity prediction, target identification, and hate speech span detection, enabling a comprehensive understanding of hateful content in online discourse.

82. 【2607.00140】CogTax: A Four-Level Cognitive Taxonomy for Command-Line Computing Education

链接https://arxiv.org/abs/2607.00140

作者:Manuel Alonso-Carracedo(1 and 2),Ruben Fernandez-Boullon(1 and 2),Pedro Celard(1 and 2),Francisco J. Rodriguez-Martinez(1 and 2),Lorena Otero-Cerdeira(1 and 2) ((1) Universidade de Vigo, Spain, (2) IFCAE, Universidade de Vigo, Spain)

类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:computing education expands, pedagogical frameworks struggle, existing pedagogical frameworks, learner actions, expands beyond traditional

备注: 35 pages, 9 figures, 4 tables

点击查看摘要

Abstract:As computing education expands beyond traditional programming into operational domains such as systems administration and command-line environments, existing pedagogical frameworks struggle to capture a dimension that is critical in these contexts: the real-world consequences of learner actions. Existing cognitive taxonomies classify learning objectives by mental operations but do not account for system impact, leaving a critical gap in command-line education where conceptually simple commands can have severe consequences. This work presents CogTax, a four-level cognitive taxonomy that integrates two dimensions: cognitive complexity, derived from Bloom's Revised Taxonomy, and operational impact, which distinguishes observational, reversible, structural, and administrative operations. The four progressive levels range from safe read-only inspection to advanced system management requiring integration of multiple abstract models. Then, the taxonomy level is defined as the maximum of these dimensions, ensuring that both conceptual understanding and operational awareness are addressed. CogTax gives instructors a principled framework for sequencing course material and calibrating assessment difficulty, and gives students an explicit reference for self-assessment and gap identification. To demonstrate that taxonomy levels are automatically assignable, making the framework scalable without manual expert annotation, a classifier that combines syntactic representations derived from abstract syntax trees with semantic embeddings is trained. Evaluated on 585 expert-annotated Linux/bash commands, this combined approach achieves 89% accuracy, outperforming either representation alone, and demonstrates cross-language extensibility through structural equivalences across command languages.

83. 【2607.00139】Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth

链接https://arxiv.org/abs/2607.00139

作者:Sajjad Abdoli,Ghassan Al-Sumaidaee,Ahmad ElShiekh,Clayton W. Taylor,Ahmed Rashad

类目:Computation and Language (cs.CL)

关键词:high-stakes domains, deploying language models, principal bottleneck, bottleneck to deploying, deploying language

备注

点击查看摘要

Abstract:The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, high-stakes domains. This is particularly acute for Arabic sociolinguistic knowledge: credible grading requires not only linguistic fluency but deep cultural familiarity that cannot be approximated by surface-level metrics. We address this with a cross-evaluation framework instantiated on two underrepresented Arabic dialect communities: Egyptian and Iraqi Arabic. We contribute 103 validated prompt-rubric pairs (70 Egyptian, 33 Iraqi; 53 Cultural, 50 Linguistic), authored and graded by native-speaker SMEs using penalty-weighted rubrics distinguishing positive content requirements from answer-specific negative error criteria. Three frontier LLMs serve as target models (graded by human SMEs across 302 unique prompt-response pairs), while five frontier LLMs serve as automated judges enforcing a provider-level self-evaluation guard. A dual-metric scheme combining Mean Absolute Deviation (MAD) with Signed Mean Error separates directional grading bias from symmetric noise. Across 1,307 judge evaluations: GPT-5.4 is the most reliable judge (MADj = 10.21 pp, Signed Error = -1.12%); four of five judges show systematic leniency (+2.01% to +6.56%); Cultural tasks are harder to grade than Linguistic tasks for all judges (MAD gap 1.83-4.78 pp); and models substantially outperform on Egyptian prompts compared to Iraqi prompts. However, given leniency differences between Iraqi and Egyptian SMEs, we cannot solely attribute this gap to model knowledge. We therefore emphasize findings that do not assume identical leniency across human graders. Across all samples, implicit cultural reasoning -- requiring models to simulate native-speaker judgment rather than rely on lexical verification -- emerges as the primary failure mode for automated grading across all judge models.

84. 【2607.00083】Harnessing the Latent Space: From Steering Vectors to Model Calibrators for Control and Trust

链接https://arxiv.org/abs/2607.00083

作者:Nishant Subramani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:unreliable text generators, highly-capable large models, trillions of parameters, changed from unreliable, unreliable text

备注: ACL 2026 (BigPicture Workshop)

点击查看摘要

Abstract:Language models have changed from unreliable text generators to highly-capable large models with trillions of parameters. Capability increases come hand-in-hand with increases in scale, making understanding the internal representations of models more challenging. Since millions of users increasing rely on language models to interact with external tools or make decisions in medium or high-stakes scenarios, we need to establish control over model behavior and know when to trust model outputs. In this paper, we discuss our contributions on harnessing the latent spaces by proposing steering vectors for control and developing latent space-based model calibrators for trust. Together, our contributions help demystify the latent spaces of language models and offer new insights into how to harness model internals to build more trustworthy language technology.

85. 【2607.00044】Destination-Labeled Self-Looping Systems with Dwell: Intrinsic Characterization, Realization Cost, and Recognition

链接https://arxiv.org/abs/2607.00044

作者:Reda Belaiche

类目:Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

关键词:finite-state symbolic controller, minimum dwell requirement, visible state carries, admissible visible transitions, study a finite-state

备注

点击查看摘要

Abstract:We study a finite-state symbolic controller for systems in which the admissible visible transitions are fixed in advance and each visible state carries a minimum dwell requirement. The resulting model, which we call a destination-labeled self-looping system with dwell (DLSL system), records the visible graph together with local decision maps; dwell memory appears only after phase expansion. The main structural issue is that, once dwell is imposed, the current visible state no longer determines whether a departure is allowed. This leads to the converse problem: which deterministic transducers arise as phase-expanded realizations of DLSL systems over a fixed visible graph? We show that the answer is exactly the class of fiber-linear graph-respecting transducers. Under natural reachability and realizable-departure assumptions, equivalent accessible realizations over the same visible graph are isomorphic; in particular, the visible transduction determines the dwell vector and the local decision maps. We also prove that any graph-preserving deterministic realization enforcing dwell values $(d_i)$ requires exactly $\sum_i d_i$ control states. Finally, we give an $O(|Q||\Omega|)$ recognition and reconstruction procedure, and extend the analysis to an edge-entry variant in which transitions may enter interior phases of successor fibers.

Subjects:

Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)

Cite as:
arXiv:2607.00044 [cs.FL]

(or
arXiv:2607.00044v1 [cs.FL] for this version)

https://doi.org/10.48550/arXiv.2607.00044

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2607.00017】Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

链接https://arxiv.org/abs/2607.00017

作者:ZhiShu Jiang,Haibo Liu,Xin Shen,Guanqiang QI,Chenxi Miao,Weikang Li,Liwei Qian,Xin Pei,Jizhou Huang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remember past interactions, http URL, Long-term conversational agents, Group Relative Policy, Relative Policy Optimization

备注

点击查看摘要

Abstract:Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building compact memory banks, yet retrieval is still often driven by query-centered similarity or fixed ranking rules, leaving user-conditioned relevance this http URL address this gap, we propose Profile-guided Personalized Retrieval Optimization (PPRO), a retrieval-centric framework that makes memory retrieval both user-aware and this http URL builds episodic and semantic memory banks from dialogue histories and derives a user profile from accumulated this http URL profile serves as an explicit personalized prior in memory ranking, allowing retrieval to account for stable user attributes, preferences, and this http URL further trains a query rewriter with Group Relative Policy Optimization, using both evidence retrieval quality and downstream answer quality as feedback while keeping the memory banks and answer model this http URL on LoCoMo and LongMemEval-S show consistent gains over training-free memory systems and training-based this http URL studies further show that both profile-guided ranking and retrieval-oriented rewriting contribute substantially to performance, highlighting retrieval optimization as a key factor in personalized long-term memory use.

87. 【2607.00010】Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

链接https://arxiv.org/abs/2607.00010

作者:Nipun B Nair,Tongtong Wu,Weiqing Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:actively elicit preferences, Conversational recommender systems, next-generation intelligent recommender, Conversational recommender, clarify intentions

备注: to be published in 2026 IEEE 42nd International Conference on Data Engineering Workshops (ICDEW)

点击查看摘要

Abstract:Conversational recommender systems (CRSs) are a core component of next-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real time. However, there are two key obstacles in the CRS domain: evaluation and access to training data. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time-consuming. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns. Large language model (LLM)-based user simulators have shown promise in addressing both challenges by generating synthetic user interactions for evaluation and training. However, existing approaches suffer from systematic positive bias, data leakage, and limited behavioral diversity, and they rely on brittle manual prompt engineering that requires extensive domain expertise. In this paper, we propose a framework to automatically optimize prompts for LLM-based user simulators in CRSs, simultaneously mitigating these issues. Experimental results demonstrate that the proposed framework achieves improved behavioral alignment with human interaction patterns compared to baseline methods across diverse prompt settings.

88. 【2607.00009】Controllable Narrative Rendering for Enhanced Assisted Writing

链接https://arxiv.org/abs/2607.00009

作者:Mingzhe Lu,Yanbing Liu,Jiayue Wu,Jiarui Zhang,Qihao Wang,Yue Hu,Yunpeng Li,Yangyan Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, persistent binary failure, basic writing assistance, language models, binary failure

备注

点击查看摘要

Abstract:Despite the remarkable proficiency of large language models (LLMs) in basic writing assistance, their utility in creative writing is fundamentally hindered by a persistent binary failure. This issue manifests as an oscillation between safe, surface-level editing, referred to as remedial polishing, and destructive, uncontrolled plot expansion. This dilemma defines a critical trade-off between narrative fidelity and descriptive intensity. We propose Loom, an assisted writing framework grounded in the narratological distinction between story and discourse. Loom employs a three-layer pipeline that operationalizes an intent-centered semiotic chain-of-thought to enforce precise control over narrative intent and rendering density. This architecture separates the generation of perceptual material from syntactic insertion, ensuring that enhancement occurs without violating the original event structure. Our comprehensive evaluation, which includes LLM-based metrics and human assessment, demonstrates that Loom successfully resolves this fundamental tension. Loom achieves the highest overall quality score, yielding substantial gains in factual integrity and descriptive intensity compared to state-of-the-art baselines.

89. 【2607.00006】Persona Without Substrate: Regime-Dependence and the LLM Individuation Problem

链接https://arxiv.org/abs/2607.00006

作者:Shuaizhi Cheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM individuation problem, unargued cross-regime co-reference, cross-regime co-reference assumption, individuation problem inherits, Beckmann Butlin

备注: 30 pages, 2 figures, 1 table. Replies to Beckmann Butlin ( [arXiv:2604.17031](https://arxiv.org/abs/2604.17031) )

点击查看摘要

Abstract:Beckmann Butlin's (2026) ontological framework for the LLM individuation problem inherits an unargued cross-regime co-reference assumption from the persona-vectors literature: that the same direction picks out the same content under prompt-conditioning, gradient-descent fine-tuning, and inference-time steering. We present four empirical wedges from persona-topology experiments on Qwen3-4B-Instruct and Mistral-7B-Instruct-v0.2 - non-collinearity of prompt-extracted vectors and fine-tune basins; fictional personas displacing the model along real-anchor directions more strongly than real anchors do; contradictory-valenced mixtures biased toward a training-history-determined attractor; and asymmetric compositional algebra under inference-time arithmetic versus fine-tune-time chimera training - that jointly undermine the assumption. We propose regime-indexed individuation: the identity unit for representational content is a (vehicle, regime) pair, not a vehicle alone. Under this framework, Beckmann Butlin's three candidate positions describe three different regime-internal objects rather than competing for the same referent; the same diagnosis applies to Mollo Millière, Chalmers, and Cerullo.

90. 【2606.31980】DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

链接https://arxiv.org/abs/2606.31980

作者:Meng Chen,Anya Ji,Tsung-Han Wu,Tobias Maringgele,David M. Chan,Alane Suhr,Amy Pavel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:automating software tasks, increasingly capable, capable of automating, software tasks, automating software

备注

点击查看摘要

Abstract:Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. We use DigitalCoach to evaluate whether state-of-the-art models can teach humans how to use computers. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding. DigitalCoach lays a foundation for collaborative and proactive computer use coaching agents.

91. 【2507.15692】Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

链接https://arxiv.org/abs/2507.15692

作者:Meng Chen,Akhil Iyer,Amy Pavel

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, Multimodal large, large language models, access visual information, provide new opportunities

备注: 18 pages, 6 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users' ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado's path to posting an image on social media.

92. 【2607.01161】Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

链接https://arxiv.org/abs/2607.01161

作者:Pol Buitrago,Javier Hernando

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:Cross-lingual speaker verification, enrollment and test, test utterances, utterances are spoken, systems typically exhibit

备注: 5 pages, 8 figures, Submitted to IberSPEECH 2026

点击查看摘要

Abstract:Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch with inter-speaker variability, as evaluation is generally performed with different speakers across languages. In this work, we introduce a bilingual same-speaker evaluation set for five Iberian languages, enabling analysis of cross-lingual SV under constant speaker identity. We apply this setup to a HuBERT-based SV system previously shown to exhibit strong language dependence, and analyze results using the Cross-Lingual Transfer Matrix (CLTM) to study pairwise cross-lingual transfer. Our results show that speaker-related variability accounts for part of the observed degradation, but language mismatch remains the main driver of cross-lingual performance loss. These findings provide a more precise characterization of language dependence in cross-lingual SV.

Comments:
5 pages, 8 figures, Submitted to IberSPEECH 2026

Subjects:

Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

Cite as:
arXiv:2607.01161 [eess.AS]

(or
arXiv:2607.01161v1 [eess.AS] for this version)

https://doi.org/10.48550/arXiv.2607.01161

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
93. 【2607.00397】NeuroCogMap Reveals Cognitive Organization of Large Language Models

链接https://arxiv.org/abs/2607.00397

作者:Zhongxiang Sun,Haolang Lu,Qiang Ma,Qi Li,Qipeng Wang,Liang Pang,Chenyu Liu,Qiankun Li,Hao Sun,Kun Wang,Yi Zeng,Jun Xu,Guoqi Li,Ji-Rong Wen

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Understanding how complex, interpreting large language, central to interpreting, interpreting large, biological cognition

备注: 79 pages, 6 main figures, 5 extended figures

点击查看摘要

Abstract:Understanding how complex cognitive functions are organized within artificial systems is central to interpreting large language models (LLMs) and relating them to biological cognition. Yet although LLMs exhibit broad cognitive-like behaviours, it remains unclear whether their internal representations form reproducible functional systems that explain behaviour, failure and links to human cognition. Here we present NeuroCogMap, a cognitive neuroscience-inspired framework that organizes internal features of LLMs into functional parcels and links them to interpretable functions, cognitive capabilities and a cognitive hierarchy. These parcels form a stable and semantically coherent organization that is partly conserved across models and functionally linked to model outputs. Within this organization, major LLM failures, including hallucination, bias, refusal failure and sycophancy, correspond to distinct disruptions in representational and behavioural-control systems, yielding internal signatures for mechanism-guided detection and targeted intervention. Beyond model behaviour, NeuroCogMap improves prediction of human cortical responses during naturalistic language comprehension, with the strongest correspondence in higher-order association cortex. At the cognitive level, its internal signatures expose latent strategies that guide refinements of classical models of human decision-making. Together, these findings establish NeuroCogMap as a system-level framework for mapping functional organization in artificial systems and for relating this organization to human cortical function and cognitive behaviour.

信息检索

1. 【2607.01170】Diffusion-GR2: Diffusion Generative Reasoning Re-ranker

链接https://arxiv.org/abs/2607.01170

作者:Zhuoxuan Zhang,Kangqi Ni,Yuhang Chen,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Frank Shyu,Adam(Yang)Song,Sandeep Pandey,Luke Simon,Tianlong Chen,Xi Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:achieve strong recommendation, sequential forward pass, Generative reasoning re-rankers, strong recommendation accuracy, re-rankers achieve strong

备注: Work in progress

点击查看摘要

Abstract:Generative reasoning re-rankers achieve strong recommendation accuracy by emitting a chain-of-thought before re-ordering a candidate list, but they are slow at inference: an autoregressive (AR) decoder spends one sequential forward pass per reasoning token, and the reasoning trace far exceeds the ranking it produces. To reduce this cost, block-diffusion language models decode many positions in parallel over a few denoising steps and are substantially faster, yet naively converting an AR re-ranker into one opens two accuracy gaps: (1) a structural gap: answer positions are denoised in parallel and scored independently, so the decoder emits invalid rankings (duplicated, dropped, or out-of-set identifiers) that AR avoids through left-to-right masking; and (2) a distributional gap: fine-tuning the converted model on fixed teacher trajectories is off-policy relative to its own decoding at inference, leaving a residual accuracy gap. To close both gaps while keeping the speedup, we propose \textbf{Diffusion-GR2}, a recipe that converts our AR reasoning re-ranker (GR2) into a block-diffusion re-ranker. First, conversion fine-tuning (CFT) adapts the AR-initialized diffusion model to denoise the answer into a valid permutation on its own, without an external constrained decoder. Next, on-policy distillation (OPD) then supervises the model on its own decoded trajectories with dense per-token targets from the AR teacher. Finally, we apply a reinforcement-learning (RL) stage against a re-ranking reward on top of OPD's on-policy policy. Experiments on Amazon Beauty demonstrate that Diffusion-GR2 recovers to near-parity with the AR re-ranker, while block-parallel decoding raises decode throughput by $2.4$--$3.5\times$ at the model's reasoning output length. Ablations show that CFT recovers most of the conversion gap, and that on-policy distillation further closes it to the AR reference.

2. 【2607.01162】rie-based Experiment Plans for Efficient IR Pipeline Experiments

链接https://arxiv.org/abs/2607.01162

作者:Irene Anu,Craig Macdonald

类目:Information Retrieval (cs.IR)

关键词:successive stages combine, Search engines, final ranking, combine the results, iteratively refine

备注: Accepted at ReNeuIR'26 workshop, colocated with SIGIR 2026. To appear in CEUR workshop proceedings

点击查看摘要

Abstract:Search engines are often formulated as cascading pipelines, where successive stages combine the results of different retrievers, and iteratively refine the ranking of candidate documents to obtain a final ranking, which can be presented to a user, or provided as context to an LLM. Such pipelines can be complex to evaluate in an end-to-end manner, necessitating measurement of Recall of early stages, and Precision of later stages, which are often interchangeable. PyTerrier is ideal for building and evaluating cascading retrieval pipelines, due to its declarative nature for pipeline construction and wide ecosystem of retrievers and rerankers. However, comparative evaluation of pipelines can be expensive due to repeated components. In this work, we describe the use of a trie data structure to formulate an experiment plan for comparative pipeline experiments that enhances experiment efficiency compared to a sequential "linear" plan. Empirically, on a demonstration experiment involving BM25, MonoT5 and DuoT5 on MSMARCO v2, we observe a 26% reduction in experiment duration. Finally, we report on a user study of undergraduate and postgraduate research students' use of the experiment plans.

3. 【2607.01071】MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

链接https://arxiv.org/abs/2607.01071

作者:Zhishang Xiang,Zerui Chen,Yunbo Tang,Zhimin Wei,Ruqin Ning,Yujie Lin,Qinggang Zhang,Jinsong Su

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:modern LLM-based agents, supporting their evolution, long-term collaborators, cornerstone of modern, modern LLM-based

备注

点击查看摘要

Abstract:Memory has emerged as a cornerstone of modern LLM-based agents, supporting their evolution from single-turn assistants to long-term collaborators. However, memory is not always beneficial: retrieved memories often induce a critical issue of sycophancy, causing agents to over-align with the user at the cost of factual accuracy or objective reasoning. Despite this emerging risk, existing memory benchmarks primarily evaluate whether memories are correctly stored, retrieved, or updated, while overlooking how retrieved memories influence downstream reasoning and decision-making. To bridge this gap, we propose MemSyco-Bench, a comprehensive benchmark for evaluating memory-induced sycophancy in agent systems. MemSyco-Bench measures when memory should influence a decision and how valid memory should be used. Specifically, it covers five tasks that assess whether agents can reject memory as factual evidence, respect its applicable scope, resolve conflicts between memory and objective evidence, track memory updates, and use valid memory for personalization. All related resources are collected for the community at this https URL.

4. 【2607.01040】As It Was: Aligning LLM Search Evaluation with Historical User Preferences

链接https://arxiv.org/abs/2607.01040

作者:Ali Vardasbi,Gustavo Penha,Enrico Palumbo,Claudia Hauff,Hugues Bouchard,Mounia Lalmas

类目:Information Retrieval (cs.IR)

关键词:human quality assurance, systems evolve faster, assurance can scale, evolve faster, quality assurance

备注

点击查看摘要

Abstract:Large-scale search systems evolve faster than human quality assurance can scale, especially for long-tail intents and multilingual queries. LLM-as-a-judge approaches provide a scalable alternative for evaluating the relevance of search engine result pages (SERPs), but judgments based solely on semantic similarity or world knowledge can drift from actual user preferences, particularly for ambiguous queries. We introduce a behavior-grounded LLM judge that augments each SERP item with a lightweight and auditable behavioral prior in the form of a Query-Relevance-Impressions (QRI) card. Each card summarizes how users have historically interacted with similar queries and results, providing compact empirical evidence that the judge can cite to resolve ambiguity and make more consistent relevance judgments while still relying on semantic reasoning. In a large-scale music search evaluation at Spotify, using relevance estimates derived from historical user interactions across 6,000 recomposed SERPs, the behavior-grounded judge achieves stronger alignment with user preferences, improving Spearman rank correlation by approximately 5% overall and yielding a 91% relative improvement on disagreement cases. On a multilingual human-judged dataset spanning five languages, grounding further increases correlation with human relevance judgments by 15%. Importantly, when evaluated against outcomes from a live A/B test, the grounded judge shows consistently higher alignment with the observed winning model. While absolute alignment remains moderate, these findings demonstrate that lightweight behavioral grounding can improve the reliability and practical usefulness of LLM-based evaluation in real-world search systems.

5. 【2607.00768】RACORN-1: Adaptive Recall-Preserving Speedup for Low-Selectivity Filtered Vector Search

链接https://arxiv.org/abs/2607.00768

作者:Yoonseok Kim,Gyusik Choe

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:Filtered Vector Search, vector embedding similarity, Filtered Vector, structured metadata predicates, production retrieval systems

备注: 13 pages, 11 figures, 10 tables

点击查看摘要

Abstract:Filtered Vector Search (FVS), which combines vector embedding similarity with structured metadata predicates, has emerged as a core requirement in RAG and production retrieval systems. ACORN-1, the representative In-filtering algorithm that reuses an existing HNSW index, substantially reduces latency at low selectivity but suffers connectivity instability below 5% selectivity and recall collapse below 1%. We propose RACORN-1, an in-place extension of ACORN-1 that resolves this collapse via (i) Adaptive Search Fallback (ASF) -- repurposing filter-failing nodes as transient bridges to detour around severed paths; bridge and two-hop candidate selection uses stride sampling for spatial diversity. While filter-first ACORN-family methods have a structural recall trade-off relative to distance-first HNSW, RACORN-1 improves the trade-off curve via ASF, minimizing recall loss while substantially reducing latency. Across three 1M-scale and one 40M-scale dataset, RACORN-1 delivers approximately 9-26x latency reduction over HNSW in the sweet spot (1%-0.3%), and recovers ACORN-1's recall collapse from 0.45-0.72 (1%) and 0.03-0.10 (0.3%) to 0.70-0.96 and 0.77-0.98 respectively. For the extreme-low-selectivity regime where linear scan can outperform graph search, we combine RACORN-1 with (ii) Adaptive Exact Fallback (AEF) in a variant RACORN-1+, achieving recall 1.00 with 20-75x speedup at 1M =0.1% and 13x speedup at 40M 0.01%. Under a Negative Correlation evaluation (K-means clusters), where ACORN-1 collapses (recall 0.08-0.41), RACORN-1 maintains recall 0.80-0.98 with a 5-9x latency advantage over HNSW. Together, RACORN-1 and RACORN-1+ form an ACORN-1-compatible mechanism robust to both extreme-low-selectivity and adversarial query-filter correlation.

6. 【2607.00728】When to Repair a Graph ANN Index: Navigability-Signal-Triggered Local Repair Protects Tail Recall Under Bursty Churn

链接https://arxiv.org/abs/2607.00728

作者:Madhulatha Mandarapu,Sandeep Kunkunuru

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:removed nodes, deletions orphan, orphan the greedy-search, greedy-search paths, paths that route

备注: 7 pages. Code + one-command reproduction: [this https URL](https://github.com/samyama-ai/updatable-graph-index)

点击查看摘要

Abstract:Graph approximate-nearest-neighbor (ANN) indexes (HNSW, DiskANN/Vamana) lose recall under insert/delete churn, because deletions orphan the greedy-search paths that route through removed nodes. Production systems restore navigability by repairing the graph on a fixed schedule (consolidate every X operations). We ask whether triggering local edge repair on a measured navigability-degradation signal, rather than a blind clock, spends a fixed repair budget better. On two real ANN datasets (SIFT-128 and Fashion-MNIST-784) under a controlled bursty churn stream, and comparing repair policies at matched amortized repair budget (equal consolidation count), signal-triggered repair Pareto-dominates fixed-cadence repair. The gain is concentrated on worst-case (tail) recall at scarce budget: at roughly one consolidation it improves the minimum recall@10 by +0.014 (SIFT) to +0.050 (Fashion-MNIST) across four stream seeds, with 95% confidence intervals excluding zero, while the mean-recall gain is small (0.005). The advantage follows a clean drift-severity gradient -- larger for sparser, more fragile graphs -- and fades to parity when the index is robust or budget is ample. A cheap probe-recall signal is a valid, leading indicator of true recall (Spearman rho ~= 0.95). We contribute the mechanism, a budget-matched evaluation protocol that separates repair scheduling from repair spend, and an open, reproducible churn-repair harness. We deliberately do not claim a mean-recall improvement or a new index; a recall-versus-repair-cost bound and data-distribution-drift coupling are left as future work.

7. 【2607.00725】What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

链接https://arxiv.org/abs/2607.00725

作者:Ananto Nayan Bala

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-augmented generation, selection problem, forces a selection, fixed reader-context budget, reader-context budget forces

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) under a fixed reader-context budget forces a selection problem: of the evidence retrieved, only a fraction can be shown to the reader. We argue that document recall -- the standard retrieval metric -- is the wrong quantity to optimize in this regime, and we make two contributions. First, as a general contribution, we introduce answer-in-context, a diagnostic that measures whether a gold answer survives as a contiguous span in the packed reader context (not the retrieved set). It predicts answer F1 better than recall (r=0.39-0.55 vs. about 0.31), separates answer quality roughly five-fold (0.60 vs. 0.12 on HotpotQA), and carries information beyond retrieval: it adds Delta R squared=0.17 over recall and shows a 4.6x EM gap even among questions where all gold was retrieved. We also confirm it interventionally: on 2WikiMultiHopQA a packing change that raises coverage but not answer-in-context yields no accuracy gain. Second, as a conditional contribution, we cast reader-context construction as budgeted monotone submodular maximization and build a packer that jointly optimizes relevance, query coverage, representativeness, and diversity. On HotpotQA with a 160-token budget and a 3B reader it beats a strong focused heuristic, MMR, and naive packing -- by up to +5.1 F1 at equal-or-lower token cost, across three seeds. Crucially, we map the scope of this win honestly: it requires the conjunction of (i) multi-hop complementary structure, (ii) retrieval that surfaces the evidence, (iii) a binding but not extreme budget, and (iv) a reader weak enough that evidence density, not reading capacity, is the bottleneck. A quantization-controlled reader-scale ladder (3B to 7B to 14B) shows the edge over the heuristic is absorbed by 7B and significantly reverses by 14B, while the diagnostic explains every boundary with a single variable.

8. 【2607.00597】Multi-Turn Agentic Scientific Literature Search via Workflow Induction

链接https://arxiv.org/abs/2607.00597

作者:Jisen Li(1 and 2),Bingxuan Li(1),Nanyi Jiang(3),Xuying Ning(1),Xiyao Wang(3),Yifan Shen(1),Heng Wang(1),Yuqing Jian(2),Xiaoxia Wu(2),Ben Athiwaratkun(2),Pan Lu(4),Jiaxuan You(1),Bingxin Zhao(3) ((1) University of Illinois Urbana-Champaign, (2) Together AI, (3) University of Pennsylvania, (4) Stanford University)

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:search, literature search, Scientific literature search, retrieving papers, users' intents

备注: 17 pages, 12 figures

点击查看摘要

Abstract:Scientific literature search often requires more than retrieving papers from a single query: users' intents are underspecified, preference-dependent, and evolve through interaction. Existing search agents typically rely on fixed pipelines or implicit language-only reasoning, making their search strategies difficult to control, inspect, and refine. We introduce PaperPilot, a multi-turn literature search agent that frames scientific search as workflow induction. Given an anchor paper and a user query, PaperPilot constructs an executable DAG of paper-search operators, including keyword search, citation expansion, filtering, scoring, reranking, and evidence extraction. User feedback is then used to refine both the query and the workflow itself. We train PaperPilot with supervised workflow imitation and preference optimization over controlled workflow corruptions. Experiments show that PaperPilot-9B improves over the base Qwen3.5-9B toolset agent under multi-turn interaction, increasing Hit@5 from 58.0 to 77.0, MRR from 47.5 to 59.4, and nDCG@10 from 26.8 to 32.5, while reducing workflow execution errors from 9.5% to 0%. These results show that explicit, editable search workflows provide an effective and controllable interface for aligning literature search agents with complex scientific intent.

9. 【2607.00508】When RAG Meets Query Planning: Logical Query Trees for Resolving Exploratory Reasoning Problems

链接https://arxiv.org/abs/2607.00508

作者:Ganlin Xu,Linghao Zhang,Zhitao Yin,Hongda Xi,Chen Yang,Jiaqing Liang,Weijia Lu,Sihang Jiang,Yanghua Xiao,Deqing Yang

类目:Information Retrieval (cs.IR)

关键词:effectively grounds large, grounds large language, involving high uncertainty, exploratory reasoning problems, effectively grounds

备注: Accepted by SIGMOD 2027

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively grounds large language models (LLMs) in external knowledge but struggles with \textbf{exploratory reasoning problems (ERPs)} that are the complex queries involving high uncertainty and ambiguity. Resolving ERPs requires complex reasoning with unclear paths, tending to result in retrieval noise and error accumulation. Furthermore, the absence of an end-to-end planning mechanism makes it difficult to generate effective trajectories for ERPs. Motivated by database query planning, we introduce \emph{PlanRAG}, an RAG framework that models ERPs of natural language as \textbf{logical query trees (LQTs)}. However, translating ERPs into LQTs is non-trivial due to representation and optimization gaps between structured SQL and unstructured natural language, making it highly challenging to construct high-quality LQTs. To address these problems, we first decompose ERPs into atomic queries and then organize them into LQTs using dynamic programming guided by a cost model involving multiple complementary dimensions. Finally, we execute iterative aggregation, rewriting, retrieval, and generation over LQTs, processing nodes concurrently and propagating intermediate results upward, with further parallelization across multiple threads for efficiency. Our experimental results show that PlanRAG outperforms state-of-the-art iteration-based and graph-based RAG systems on our newly constructed dataset, \textbf{WikiWeb-ERP}, thereby providing a new formulation for optimizing natural language queries. Our source code and dataset are available at this https URL.

10. 【2607.00448】Real-Time Hard Negative Sampling via LLM-based Clustering for Large-Scale Two-Tower Retrieval

链接https://arxiv.org/abs/2607.00448

作者:Ivan Ji,Liuyi Hu,Harrison(Zihao)Zhao,Lei Huang,Qunshu Zhang, Max (Xiangjun)Fan,Aameek Singh

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:retrieval stage, negative sampling, sampling, two-tower models typically, negative sampling technique

备注

点击查看摘要

Abstract:The two-tower model has been widely used for large-scale recommendation systems, particularly in the retrieval stage. Industry standards for training two-tower models typically involve in-batch and/or out-of-batch negative sampling. However, these methods often produce easy negatives that models can quickly learn, failing to sufficiently challenge the model. To address this issue, a novel self-supervised hard negative sampling technique is proposed that leverages a large language model (LLM) to generate hard negatives from the same cluster during model training. By utilizing the LLM to learn media representations, the proposed approach ensures that the generated negatives are more challenging and informative. This real-time sampling framework is designed for seamless integration into production models, capable of handling billions of training data points with minimal computational complexity. Experiments on public datasets, along with deployment to a large-scale online system, demonstrate that the proposed negative sampling technique outperforms widely used industry methods. Furthermore, analysis in industrial applications reveals that this sampling method can help break inherent feedback loops in recommendations and significantly reduce popularity bias.

11. 【2607.00379】Attribute-Prompted Kernel Hashing for Unsupervised Data-Efficient Cross-Modal Retrieval

链接https://arxiv.org/abs/2607.00379

作者:Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically related instances, manual semantic annotation, hashing enables efficient, requiring manual semantic, enables efficient retrieval

备注

点击查看摘要

Abstract:Unsupervised cross-modal hashing enables efficient retrieval of semantically related instances across different modalities without requiring manual semantic annotation. However, existing unsupervised methods rely heavily on large-scale image-text pairs. Collecting such data can be costly, particularly in scenarios where well-aligned pairs are scarce due to privacy and specialized constraints. More critically, existing methods tend to overfit to seen training data, restricting their generalization performance on unseen categories that the constrained training data cannot cover. To address these limitations, we propose Attribute-Prompted Kernel Hashing (APKH), a novel data-efficient approach that constructs a compact, modality-aligned Hamming space driven by the generalized attribute priors of vision-language foundation models. Specifically, APKH introduces two core modules: Context-optimized Attribute Kernel Mapping (CAKM) and Kernel-Smoothed Contrastive Alignment (KSCA). CAKM formulates cross-modal alignment through hyperspherical Radial Basis Function kernel mapping, optimizing dynamic attribute kernels via prompt learning to capture modality-invariant semantics. Furthermore, KSCA extends conventional point-to-point contrastive learning by modeling limited paired data as continuous kernel distributions. This explicit smoothing of the modality gap alleviates overfitting to sparse pairwise correlations. Extensive experiments demonstrate that APKH outperforms state-of-the-art hashing methods in the challenging cross-modal retrieval tasks from seen to unseen categories under data-constrained scenarios.

12. 【2607.00374】Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

链接https://arxiv.org/abs/2607.00374

作者:Jingjing Zhang,Lei Zhang,Zheren Fu,Zhendong Mao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Composed Image Retrieval, Image Retrieval, reference image, Image, Composed Image

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text pairs. However, existing proxy tasks primarily enhance visual and textual representations to accommodate a predefined composition mechanism such as pseudo-word injection into a frozen text encoder or linear feature arithmetic. As a result, the composition function itself remains unlearned, limiting the model's ability to express diverse and fine-grained semantic modifications. To address this, we propose FoCo, which models composition as two coordinated stages: focusing on modification-relevant visual content, and then completing the target semantics. We realize these through two proxy tasks: text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics, and context-conditioned semantic completion to transform these aggregated visuals with the remaining scene context into a coherent composed representation. The tasks are trained jointly with a cross-instance contrastive objective, encouraging semantic diversity and discouraging shortcut composition strategies. Extensive experiments on four ZS-CIR benchmarks show FoCo's state-of-the-art performance and improved generalization.

13. 【2607.00159】Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

链接https://arxiv.org/abs/2607.00159

作者:Qian Ma,S M Rayeed,Charles V. Stewart,Qiong Wu,Yao Ma

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Visual Question Answering, Visual Language Models, external structured knowledge, existing KB-VQA benchmarks, Visual Language

备注: Accepted to ECCV 2026. The datasets and code are available in [this https URL](https://github.com/VAN-QIAN/ECCV26-ARA)

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge-grounded reasoning. However, for existing KB-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation. In this work, we show that these assumptions are systematically violated in existing KB-VQA benchmarks. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric. Furthermore, we find that existing datasets rely on visually trivial, single-entity scenes that bypass the need for sophisticated visual-to-knowledge mapping. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities. To address this, we introduce (1) a principled audit-and-repair protocol that restores answer derivability and question clarity, and (2) a controlled multi-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning. Re-evaluation under corrected and augmented settings yields markedly different performance trends. Our findings call for rethinking evaluation protocols and designing more interaction-aware KB-VQA benchmarks that prioritize verifiable reasoning over simple matching.

14. 【2607.00052】AGE: Adaptive-masking for Graph Embedding in Graph Retrieval-Augmented Generation

链接https://arxiv.org/abs/2607.00052

作者:Bao Long Nguyen Huu,Atsushi Hashimoto

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large language models, supports large language, retrieval-augmented generation, external knowledge, extension of retrieval-augmented

备注

点击查看摘要

Abstract:GraphRAG is an extension of retrieval-augmented generation (RAG) that supports large language models (LLMs) by referring to graph-structured data as external knowledge. While this technique ideally captures intricate relationships, it often struggles with graph representations for LLMs, particularly for frozen LLMs, due to the misalignment between graph-based and text-based latent features. We tackle this issue by introducing the {\it Adaptive-masking for Graph Embedding (AGE)}. AGE employs a Transformer in a mask-based self-supervised learning (SSL) approach. We designed the architecture similar to text embedding encoders, addressing the latent feature misalignment. In contrast to natural language texts, graphs are concise representations, and there exist {\it key nodes} that hold dominant contextual information, which are challenging to predict from their surroundings. Masking such key nodes leads to inefficiency in the SSL process. Therefore, AGE focuses on predicting nodes apart from key nodes, utilizing a learnable node sampler. Our experimental results indicate that AGE significantly improves approaches using non-parametric search component in GraphQA tasks, achieving superior accuracy across four benchmark datasets with distinct characteristics.

15. 【2607.00023】Aligning Sentence Embeddings to Human Concepts via Sparse Autoencoders

链接https://arxiv.org/abs/2607.00023

作者:Wonseok Shin,Songkuk Kim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:modern Retrieval-Augmented Generation, Retrieval-Augmented Generation, Dense sentence embeddings, systems but suffer, Top-k Sparse Autoencoders

备注

点击查看摘要

Abstract:Dense sentence embeddings are fundamental to modern Retrieval-Augmented Generation (RAG) systems but suffer from a lack of interpretability due to feature superposition. This opacity hinders the alignment of retrieval processes with human intent, as the entangled representations are difficult to analyze or control. In this work, we propose a method to disentangle the dense representations of sentence transformers (e.g., E5) into human-interpretable concepts using Top-k Sparse Autoencoders (SAEs). We demonstrate that these disentangled features align with specific semantic, syntactic, and pragmatic categories. Furthermore, we introduce an activation steering mechanism that allows for precise intervention in the retrieval process. By clamping specific latent features, we show that it is possible to re-rank search results to better align with user constraints without retraining the backbone model. Our findings suggest that SAE-based decomposition offers a viable path toward transparent and steerable neural information retrieval.

16. 【2607.00017】Learning User-Aware Recall: Personalized Retrieval in Long-Term Conversational Memory

链接https://arxiv.org/abs/2607.00017

作者:ZhiShu Jiang,Haibo Liu,Xin Shen,Guanqiang QI,Chenxi Miao,Weikang Li,Liwei Qian,Xin Pei,Jizhou Huang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remember past interactions, http URL, Long-term conversational agents, Group Relative Policy, Relative Policy Optimization

备注

点击查看摘要

Abstract:Long-term conversational agents are expected to remember past interactions, but memory is useful only when the right evidence is recalled for the right user. Existing memory-augmented LLM agents have made progress in building compact memory banks, yet retrieval is still often driven by query-centered similarity or fixed ranking rules, leaving user-conditioned relevance this http URL address this gap, we propose Profile-guided Personalized Retrieval Optimization (PPRO), a retrieval-centric framework that makes memory retrieval both user-aware and this http URL builds episodic and semantic memory banks from dialogue histories and derives a user profile from accumulated this http URL profile serves as an explicit personalized prior in memory ranking, allowing retrieval to account for stable user attributes, preferences, and this http URL further trains a query rewriter with Group Relative Policy Optimization, using both evidence retrieval quality and downstream answer quality as feedback while keeping the memory banks and answer model this http URL on LoCoMo and LongMemEval-S show consistent gains over training-free memory systems and training-based this http URL studies further show that both profile-guided ranking and retrieval-oriented rewriting contribute substantially to performance, highlighting retrieval optimization as a key factor in personalized long-term memory use.

17. 【2607.00016】Libra: Training the Environment for Agentic Information Retrieval

链接https://arxiv.org/abs/2607.00016

作者:Xuan Zhao,Andy Chiu,Gengyu Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:agentic LLM systems, Information localization, cornerstone of agentic, agentic LLM, Information

备注

点击查看摘要

Abstract:Information localization within massive repositories is a cornerstone of agentic LLM systems. While synthetic data-driven optimization has proven successful in training LLMs, little attention has been paid to optimizing the agent's working environment (the repository itself) in a data-driven manner. To bridge this gap, we present Libra, a self-evolving framework that introduces mutable "catalogs" (hierarchical Markdown files serving as navigable indices) into the repository. Libra runs an LLM-driven optimization loop where a Prompter generates synthetic queries, a frozen Solver attempts to resolve them by navigating the catalogs, and a Healer rewrites the catalogs in response to the Solver's localization failures. Evaluations across 12 SWE-bench Lite repositories demonstrate that this environmental healing yields continual, logarithmic improvements in code localization accuracy. Furthermore, these environmental improvements transfer zero-shot across different LLMs and problem sets. Although the focus of this paper is to study the general behavior of such a system, we also demonstrate that a minimalist coding agent equipped with Libra-optimized catalogs outperforms state-of-the-art baselines. Code is available at this https URL and data at this https URL.

18. 【2607.00013】GRACE-RAG: Governed Retrieval Architecture for Canonical Evidence Synthesis, Enabling Lightweight Deployment in Closed-Domain Institutional Settings

链接https://arxiv.org/abs/2607.00013

作者:Asit Desai,Aman Kumar,Prashant Devadiga

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:question answering settings, Retrieval-Augmented Generation, institutional question answering, authoritative documentation, question answering

备注: 15 pages, 5 figures, 4 tables. Submitted to COLM 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems are widely used in institutional question answering settings where responses must be grounded in authoritative documentation (Gao et al., 2023). In entity-dense domains where relevant information is distributed across heterogeneous documents, vector-only retrieval often produces fragmented evidence and increases dependence on inference-time reasoning (Zhao et al., 2024). This paper introduces GRACE-RAG, a retrieval-governed, graph-augmented RAG architecture that externalizes structural reasoning from the generative stage to a structured retrieval layer, resolving structural ambiguity offline, enabling deployment on self-hosted lightweight models calibrated to closed-domain institutional vocabulary. Experiments across three model capacities: Mistral 24B, GPT OSS 120B, and Gemini 2.5 Flash show consistent improvements in completeness, depth, and anticipatory coverage, with overall quality gains of up to 20% under mid-scale models, indicating that retrieval architecture governs structural quality over model scale, reducing computational and latency footprint without dependence on proprietary systems.

19. 【2607.00012】PRA-RAG: Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption

链接https://arxiv.org/abs/2607.00012

作者:Xue Tan,Yi Zheng,Chang Huo,Yunruo Zhang,Yu Liu,Hao Luan,Zhuyang Yu,Xiaoyan Sun,Ping Chen,Jun Dai

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:enhances Large Language, Large Language Models, Large Language, Retrieval-Augmented Generation, enhances Large

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, effectively mitigating their inherent knowledge limitations. However, RAG remains vulnerable to poisoning attacks that manipulate retrieved texts to mislead model outputs. Existing defense mechanisms often lack theoretical robustness guarantees and perform unreliably when the LLM has limited knowledge of the retrieved content. In this work, we propose PRA-RAG, a provably robust retrieval aggregation algorithm designed to defend against poisoning attacks on retrieved texts. PRA-RAG samples multiple combinations of retrieved texts and utilizes geometric structures in the embedding space to identify a robust subset, from which a stable aggregated representation is derived. We provide theoretical bounds on the maximum impact of poisoned retrieved content and establish a quantitative measure of RAG's robustness. Experiments across multiple benchmarks and RAG architectures demonstrate that PRA-RAG reduces the attack success rate to as low as 1% while maintaining an accuracy of 71%, significantly outperforming representative state-of-the-art methods.

20. 【2607.00011】SkillSelect-Serve: Budget-Controllable and QoS-Aware Skill Service Recommendation and Composition for Small LLM Agents

链接https://arxiv.org/abs/2607.00011

作者:Jingyuan Zheng,Dongjing Wang,Xin Zhang,Butian Huang,Haiping Zhang,Dongjin Yu,Shuguang Deng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词:Reusable skill libraries, existing selection methods, large language model, Reusable skill, Skill Service Recommendation

备注: 5 figures, 6 tables

点击查看摘要

Abstract:Reusable skill libraries are becoming important infrastructure for large language model (LLM) agents, yet existing selection methods often treat skills as retrievable documents and return fixed top-k lists. This paper presents SkillSelect-Serve, a budget-controllable and QoS-aware framework that formulates agent skill selection as Skill Service Recommendation and Composition. SkillSelect-Serve represents raw skills as structured Skill Services with functional descriptions, dependencies, context cost, risk, and QoS-related attributes. A local Micro-Agent Requirement Planner converts natural-language tasks into structured service requirements, while a shared discovery backbone retrieves candidate services from a large registry. The framework then performs dual-granularity utility modeling with skill-level marginal suitability estimation and bundle-level calibration for coverage, redundancy, cost, and risk trade-offs. Experiments on 35,353 skills and 586 task queries show that SkillSelect-Serve consistently improves same-budget bundle recall and mean utility over fixed top-k retrieval baselines.

21. 【2607.00010】Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

链接https://arxiv.org/abs/2607.00010

作者:Nipun B Nair,Tongtong Wu,Weiqing Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:actively elicit preferences, Conversational recommender systems, next-generation intelligent recommender, Conversational recommender, clarify intentions

备注: to be published in 2026 IEEE 42nd International Conference on Data Engineering Workshops (ICDEW)

点击查看摘要

Abstract:Conversational recommender systems (CRSs) are a core component of next-generation intelligent recommender systems because they enable users to actively elicit preferences, clarify intentions, and adapt recommendations in real time. However, there are two key obstacles in the CRS domain: evaluation and access to training data. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time-consuming. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns. Large language model (LLM)-based user simulators have shown promise in addressing both challenges by generating synthetic user interactions for evaluation and training. However, existing approaches suffer from systematic positive bias, data leakage, and limited behavioral diversity, and they rely on brittle manual prompt engineering that requires extensive domain expertise. In this paper, we propose a framework to automatically optimize prompts for LLM-based user simulators in CRSs, simultaneously mitigating these issues. Experimental results demonstrate that the proposed framework achieves improved behavioral alignment with human interaction patterns compared to baseline methods across diverse prompt settings.

22. 【2607.00008】SchemaRAG: Dynamic Large Schema Reduction for LLM-driven Structured Information Extraction

链接https://arxiv.org/abs/2607.00008

作者:Sin Yu Bonnie Ho,Arlie Coles,Erik Larsson,Eric Marshall,Nathan Bodenstab,Paul Vozila

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Extracting structured data, large language models, Extracting structured, language models, large language

备注

点击查看摘要

Abstract:Extracting structured data from unstructured text using large language models (LLMs) becomes challenging when target schemas are large and complex. In such cases, including the full schema in the prompt increases cost and latency, risks lost-in-the-middle performance degradation, and can exceed context length limits. We propose SchemaRAG, a retrieval-augmented generation (RAG) framework that dynamically prunes the output schema space for schema-conditioned information extraction tasks by leveraging schema metadata and few-shot examples when available. We evaluate SchemaRAG on real-world healthcare and e-commerce datasets. Our results show that SchemaRAG can achieve up to an 8.8% increase in micro-F1, a 47% reduction in latency, and a 48% reduction in token costs, demonstrating its practicality for large-schema extraction.

23. 【2607.00007】BaRA: BFS-and-Reflection Web Data Collection Agent

链接https://arxiv.org/abs/2607.00007

作者:Soojeong Lee,Joseph Lee,Yongseong Cho,Sunjae Kim,Youngwoo Moon,Kyungwoo Song

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large language model, miss relevant pages, based web agents, reduce manual scripting, web agents reduce

备注

点击查看摘要

Abstract:Large language model (LLM)-based web agents reduce manual scripting for web data collection, yet on live websites, they often miss relevant pages, return incomplete multimodal outputs, or return media URLs that are not directly downloadable. We present BFS-and-Reflection Agent (BaRA), a framework for site-level collection under a fixed interaction budget. The framework combines bounded breadth-first search (BFS) traversal with history-based self-reflection. We evaluate BaRA on 50 synthetic websites with ground-truth reference sets. We additionally test on three public websites with cluttered or dynamic layouts. BaRA outperforms Pure LLM, SeeAct-Vision, and Browser-use on link discovery and downloadable multimodal extraction, with the largest gains in download-valid image and video recovery. Our code is available at this https URL.

24. 【2607.00005】opological Void Analysis A Mathematical Framework for Systematic Technical Innovation Discovery in Knowledge Spaces

链接https://arxiv.org/abs/2607.00005

作者:Kris Pan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:high-dimensional knowledge space, dense technical domain, software co-design, systems or hardware, Topological Void Analysis

备注: 11 pages, 3 tables, 2 case studies; arXiv Industry Track

点击查看摘要

Abstract:Identifying where to innovate in a dense technical domain - such as operating systems or hardware/software co-design - is fundamentally a search problem in a high-dimensional knowledge space. Existing approaches rely on keyword search, citation proximity, or human intuition, none of which formalise the notion of an unexplored region that is simultaneously relevant to a target goal and absent from prior art. We present Topological Void Analysis (TVA), a mathematical framework that defines topological voids as triads (A, B, C) in a dense-sparse hybrid embedding space. A void requires three conditions: (i) both concepts A and B are semantically cohesive with domain anchor C; (ii) their pairwise similarity falls within a calibrated marginality band - avoiding both obvious combinations and unrelated noise; and (iii) they share a sparse lexical bridge while the geodesic midpoint on the embedding hypersphere is unoccupied. Applied to ~140k indexed documents, TVA generates 2,128 invention candidates across 96 targets; 90% survive automated quality filtering, yielding 191 REVISE and 1 APPROVE verdict from four-specialist adversarial review (0.05% end-to-end). Two case studies demonstrate the framework surfaces non-obvious connective tissue rather than merely obvious related pairs.

Comments:
11 pages, 3 tables, 2 case studies; arXiv Industry Track

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

ACMclasses:
I.2.6; H.3.3

Cite as:
arXiv:2607.00005 [cs.IR]

(or
arXiv:2607.00005v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2607.00005

Focus to learn more

              arXiv-issued DOI via DataCite</p>
25. 【2607.00004】Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps

链接https://arxiv.org/abs/2607.00004

作者:Zhichao Geng,Yang Yang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:significantly outperform older, aging BERT-base baseline, learned sparse retrieval, ModernBERT significantly outperform, outperform older architectures

备注: Accepted at SIGIR 2026

点击查看摘要

Abstract:While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textit{Vocabulary Gap}: modern tokenizers utilize raw, case-sensitive vocabularies designed for lossless reconstruction, which map single semantic units to redundant surface forms, wasting model capacity on morphological noise and hindering lexical matching. We formalize this intuition through a theoretical framework, demonstrating that appropriate vocabulary coarse-graining can tighten the generalization bounds by reducing complexity of the hypothesis class, provided that semantic integrity is preserved. To resolve this, we propose \textbf{Vocabulary Transfer (VT)}, a model-agnostic framework that migrates advanced encoders to sparse-friendly, normalized vocabularies with minimal computational cost. VT utilizes a novel \textbf{Semantic Initialization} via spatial topology to preserve geometric structure and an \textbf{Activation Potential Calibration (APC)} mechanism to align pre-trained manifolds with sparsity constraints, preventing the dead neuron and dense collapse observed in standard fine-tuning. Empirically, VT is universally effective: it enables ModernBERT to achieve state-of-the-art performance on the BEIR benchmark (\textbf{52.4} nDCG, a \textbf{+4.7} improvement), resuscitates failing models like RoBERTa-large, and generalizes seamlessly to inference-free architectures and specialized domains. These results confirm that the performance lag is not an architectural deficiency but a solvable vocabulary mismatch. We've released our code and models.\footnote{this https URL. All details included.}

Comments:
Accepted at SIGIR 2026

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2607.00004 [cs.IR]

(or
arXiv:2607.00004v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2607.00004

Focus to learn more

              arXiv-issued DOI via DataCite

Related DOI:

https://doi.org/10.1145/3805712.3809724

Focus to learn more

            DOI(s) linking to related resources</p>
26. 【2607.00003】From "Strings" to "Things" for Personal Knowledge Graphs: Evaluating LLM Triple Extraction for Recommendation Systems

链接https://arxiv.org/abs/2607.00003

作者:Abhirup Dasgupta,Fernando Spadea,Oshani Seneviratne

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Personal Knowledge Graphs, Personal Knowledge, modeling user preferences, decentralized conversational data, Knowledge Graphs

备注

点击查看摘要

Abstract:Personal Knowledge Graphs (PKGs) offer a privacy-preserving framework for modeling user preferences, yet constructing them from unstructured, decentralized conversational data remains a challenge. This paper bridges the gap between conversational "strings" and semantic "things" by presenting a reproducible pipeline for extracting structured user-preference triples using lightweight Large Language Models. We evaluate Qwen- and Gemma-based models on their ability to extract RDF-compliant triples linked to Wikidata identifiers from conversational data for PKG construction. Our evaluation assesses both the semantic extraction fidelity and the utility of the resulting graphs in a downstream recommendation task. We found that certain models performed well and had proportionally high downstream performance relative to their triple extraction performance.

计算机视觉

1. 【2607.01222】Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models

链接https://arxiv.org/abs/2607.01222

作者:Yue Han,Chong Li,Zhening Liu,Cong Huang,Fang Deng,Yong Liu,Fangyun Wei,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rich surface appearance, reference images, largely due, training data, Recent

备注: Accepted to ECCV 2026. Project page: [this https URL](https://yuehan99.github.io/Ink3D-TextureGen/)

点击查看摘要

Abstract:Recent 3D generative models can synthesize high-quality geometry but often struggle to reproduce intricate textures from reference images, largely due to the scarcity of large-scale 3D training data with rich surface appearance. In contrast, visual generative models are trained on datasets several orders of magnitude larger and excel at modeling complex visual patterns. Motivated by this gap, we introduce Ink3D, a framework that bridges 3D generation with large-scale video generative models to synthesize extremely complex textures. Ink3D first reconstructs a white-mesh geometry using an off-the-shelf 3D generation model. It then employs OrbitPainter, a conditional video generative model, to produce dense orbit-scan videos capturing object appearance across viewpoints. To convert these views into coherent textures, we introduce TextureOptimizer, a neural baking module that integrates dense multi-view observations while mitigating geometry inconsistencies arising from video generation. By decoupling geometry and texture synthesis and leveraging large-scale pretrained video priors, Ink3D enables significantly richer and more faithful texture generation than prior approaches.

2. 【2607.01205】Linkify: Learning from Interface-Augmented Assembly Graphs

链接https://arxiv.org/abs/2607.01205

作者:Anushrut Jignasu,Daniele Grandi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enable context-aware part, interface-augmented assembly graphs, framework for learning, learning from interface-augmented, enable context-aware

备注: Code is available at [this https URL](https://github.com/ajignasu/linkify)

点击查看摘要

Abstract:We present Linkify, a framework for learning from interface-augmented assembly graphs to enable context-aware part retrieval in mechanical assemblies. While recent generative AI methods for CAD have focused largely on isolated parts or monolithic assemblies, the rich geometric information at the interfaces between parts, where function is realized, remains underexplored. We address this gap by recomputing high-fidelity interface geometry for the Fusion 360 Gallery Assembly dataset, correcting missing and erroneous contacts, and generating point-cloud representations of local contact regions. Using this data, we construct assembly graphs whose nodes encode part geometry and whose edges encode interface geometry via a pretrained point-cloud encoder. On top of this representation, we train a Graph Attention Network based on GATv2 to solve a masked part prediction task: given an assembly with one part held out, the model predicts the class of the missing component from a large vocabulary of geometrically clustered parts, thereby approximating a realistic part-retrieval scenario. Compared to non-graph baselines such as logistic regression and k-nearest neighbors operating on aggregated node features, Linkify achieves higher Top-K accuracy and F1 scores. Ablation studies on graph connectivity, edge attributes, and attention mechanisms demonstrate that accurate contact computation and dynamic attention over interfaces are critical for performance. Our corrected interface dataset and training pipeline, released publicly, provide a foundation for future interface-aware models for assembly retrieval, validation, and generative design.

3. 【2607.01202】World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

链接https://arxiv.org/abs/2607.01202

作者:Liyuan Zhu,Shengyu Huang,Amrita Mazumdar,Tianye Li,Zan Gojcic,Gordon Wetzstein,Iro Armeni,Shalini De Mello,Alex Trevithick

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:generating freely renderable, present World, freely renderable dynamic, Gaussian representations, generating freely

备注: Project page: [this https URL](https://research.nvidia.com/labs/amri/projects/world-from-motion/)

点击查看摘要

Abstract:We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS representations, with simulated artifacts characteristic of monocular reconstruction. At test time, we distill the model's generations, including newly observed regions and motions, back into a single consistent, high-quality dynamic 3DGS, improving both novel-view synthesis and the underlying 3D motion. Our method sets a new state of the art in 4D reconstruction and seamlessly generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

4. 【2607.01191】Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

链接https://arxiv.org/abs/2607.01191

作者:Hongxing Li,Xiufeng Huang,Dingming Li,Wenjing Jiang,Zixuan Wang,Haolei Xu,Hanrong Zhang,Haiwen Hong,Longtao Huang,Hui Xue,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical visual cues, reasoning remains challenging, Fine-grained visual reasoning, Fine-grained visual, visual reasoning remains

备注: Code: [this https URL](https://github.com/ZJU-REAL/Perceive-to-Reason)

点击查看摘要

Abstract:Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.

5. 【2607.01176】High-dimensional Embedding Prior for Noisy K-space Domain MRIReconstruction

链接https://arxiv.org/abs/2607.01176

作者:Yu Guan,Tianjia Huang,Qinrong Cai,Qiuyun Fan,Dong Liang,Qiegen Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic resonance imaging, realistic acquisition conditions, Magnetic resonance, resonance imaging, noise-corrupted measurements

备注

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) reconstruction under realistic acquisition conditions can be fundamentally viewed as estimating the underlying k-space distribution from incomplete and noise-corrupted measurements. While diffusion models have recently shown strong potential as generative prior for inverse problems,existingapproachesstruggletohandlenoisyreconstruction settings, especially when operating directly in k-space domain. In this work, we propose a unified high-dimensional k-space reconstruction framework tailored for noisy inverse problems, whichenhancesdiffusion-based solversthroughrepresentation this http URL underlying optimization procedures, the proposed framework augments the data representation space, enabling existing diffusion-based solvers to operate on enriched k-space embeddings with improved expressiveness. Extensive experiments on both in-house and public datasets across varying noise levels and undersampled factors demonstrate that the proposed frame work consistently improves reconstruction quality for multiple diffusion-based inverse solvers. Notably, the largest gains are observed in high-noise regimes, which is consistent with our theoretical analysis of error propagation under high-dimensional representation. These results suggest that high-dimensional representation provides a general and model-agnostic mechanism for improving diffusion-based MRI reconstruction in noisy settings, offering a new perspective on robust k-space generative modeling for practical inverse problems. The code will be available at this https URL.

6. 【2607.01166】Structured 4D Latent Predictive Model for Robot Planning

链接https://arxiv.org/abs/2607.01166

作者:Zhiyi Li,Peilin Wu,Xiaoshen Han,Ruojin Cai,Yilun Du

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Video predictive models, Latent Predictive Model, offering a promising, flexible decision-making, powerful paradigm

备注

点击查看摘要

Abstract:Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at this https URL.

7. 【2607.01147】EquiSteer: Cross-Attention Steering Towards a Fairer Text-Guided Image Generation

链接https://arxiv.org/abs/2607.01147

作者:Tatiana Gaintseva,Akshit Achara,Gregory Slabaugh,Jiankang Deng,Ismail Elezi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models power, everyday creative tasks, models power everyday, power everyday creative, diffusion models

备注

点击查看摘要

Abstract:Text-to-image diffusion models power everyday creative tasks, but they still reproduce the demographic biases in their training data. On common prompts such as ``a photo of a nurse,'' ``a photo of a CEO'', they skew their outputs toward one gender, driven by the statistics of training data rather than anything in the text. Existing debiasing methods show promise in narrow settings but require retraining, batch-level control, or prompt-specific tuning, limiting their scalability. We propose \emph{EquiSteer}, a training-free method that works per sample by steering cross-attention (CA) activations at inference time. For each target attribute, EquiSteer precomputes steering vectors from contrastive prompts. Then at generation time, a prompt-aware gate leaves attribute-specific prompts untouched, while for neutral ones it clears existing attribute signals from the CA activations and injects a target attribute. Across SD-1.5, SD-2.1, SDXL, and SANA, EquiSteer reduces the average parity gap by up to $87\%$, with minimal effect on image quality and text-image alignment. Code is available at \href{this https URL}{this https URL}.%

8. 【2607.01140】Relation-Centric Open-Vocabulary 3D Gaussian Segmentation

链接https://arxiv.org/abs/2607.01140

作者:Eunsung Cha,Hyunjoon Lee,Jaesik Park

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires language understanding, Gaussian segmentation, understanding for diverse, diverse queries, queries and accurate

备注: Project Page: [this https URL](https://eunsungcha.github.io/PairGS-web/)

点击查看摘要

Abstract:Open-vocabulary 3D Gaussian segmentation is challenging because it requires language understanding for diverse queries and accurate separation of Gaussians along object boundaries. Prior approaches either embed language knowledge into individual Gaussians to improve query responsiveness or optimize per-Gaussian instance features to encode object identity. However, these strategies may produce noisy Gaussian segmentations or rely on cost-inefficient per-scene optimization. We propose PairGS, a framework that reframes Gaussian segmentation as modeling pairwise relations between Gaussians. 3D Gaussian representations provide rich signals for relation estimation, such as view contribution weights and multi-view mask evidence. By leveraging these cues, PairGS explicitly constructs a relation graph for segmentation without a heavy optimization process. PairGS first proposes sparse edge candidates using low-dimensional descriptors, computes precise pairwise affinities only on those candidates, and builds a hierarchical cluster tree for multi-granular querying. It achieves state-of-the-art results on open-vocabulary 3D Gaussian segmentation benchmarks, while the fast variant is 50x faster than optimization-based instance-feature approaches.

9. 【2607.01139】SD-RouteFusion: Ego-Trajectory Prediction with SD-Map Route Conditioning

链接https://arxiv.org/abs/2607.01139

作者:Sviatoslav Voloshyn,Bruno K. W. Martens,Wangxin Liu,Jakob Vinkås,Junsheng Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Standard Definition, paper presents SD-RouteFusion, navigation route derived, High Definition, vehicle kinematics

备注: 9 pages, 4 figures, 29th International Conference on Information Fusion

点击查看摘要

Abstract:This paper presents SD-RouteFusion, a deployable end-to-end ego-trajectory prediction method that fuses a front-facing camera, vehicle kinematics, and a navigation route derived from a Standard Definition (SD) map. Unlike approaches that rely on High Definition (HD) map geometry, SD-RouteFusion aligns the learning objective with scalable and production-ready SD-map route inputs, enabling route-aware prediction without requiring HD-map infrastructure. First, we demonstrate that SD-map route prior provides a powerful long-horizon semantic prior. Through a comprehensive study on a large-scale real-world dataset comprising 480k driving scenarios across 10 European countries and the U.S., we quantify the value of SD-route conditioning: incorporating SD-map routes yields a 10.5% ADE improvement over an image-and-kinematics baseline, while our full fusion strategy achieves a 16.9% ADE reduction given a prediction horizon of 8 seconds. The fusion strategy consists of a dual-hypothesis design paired with a gated classifier, to ensure robustness under route corruption and visual uncertainty. Finally, to support broader evaluation, we release an SD-route generation toolkit that enables SD-route-conditioned ego-trajectory prediction on all datasets containing ego pose and future trajectories. Together, SD-RouteFusion establishes a practical path toward robust, route-aware ego-trajectory prediction at scale.

10. 【2607.01133】owards Metric-Agnostic Trajectory Forecasting

链接https://arxiv.org/abs/2607.01133

作者:Markus Knoche,Daan de Geus,Bastian Leibe

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:plan safe maneuvers, Accurate trajectory forecasting, surrounding traffic participants, Open Motion Dataset, Accurate trajectory

备注: ECCV 2026. Project page at [this https URL](https://vision.rwth-aachen.de/TraDiE-policies)

点击查看摘要

Abstract:Accurate trajectory forecasting of surrounding traffic participants is a core capability for autonomous driving, enabling vehicles to anticipate behavior and plan safe maneuvers. We observe that current state-of-the-art forecasting models on Argoverse 2 and the Waymo Open Motion Dataset tailor their training objectives to the different benchmark metrics. Because these metrics encourage conflicting behavior, we propose a paradigm change for trajectory forecasting: training models with metric-agnostic probabilistic objectives and treating metric optimization as a downstream task applied to the predictive distribution. Concretely, we introduce Trajectory Distribution Evaluation (TraDiE) policies, metric-specific policies that map a predictive distribution to the set of $K$ trajectories and confidences required by trajectory forecasting metrics. We evaluate this framework by introducing DONUT-NLL, which adapts the training objective of the state-of-the-art trajectory forecasting model DONUT to directly optimize the predictive distribution. Using our policies, DONUT-NLL achieves state-of-the-art results on all metrics of the Waymo motion prediction benchmark.

11. 【2607.01131】Autonomous Scientific Discovery via Iterative Meta-Reflection

链接https://arxiv.org/abs/2607.01131

作者:Bingchen Zhao,Sara Beery,Oisin Mac Aodha

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generation and validation, offer the potential, potential to accelerate, automating the process, discovery systems offer

备注

点击查看摘要

Abstract:Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.

12. 【2607.01117】MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

链接https://arxiv.org/abs/2607.01117

作者:Jiale Li,Sihan Chen,Mengyuan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Large Language, Large Language Models, Large Language, shown strong progress, Video Large

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have shown strong progress in video understanding, yet they still suffer from hallucinations that are inconsistent with visual evidence. Existing benchmarks mainly focus on object hallucination or coarse action perception, leaving a key video-specific problem underexplored: motion hallucination, in which models infer human motions that are absent from the video. We present MoHallBench, a benchmark for diagnosing motion hallucination in VideoLLMs. MoHallBench systematically evaluates three major sources of hallucination: co-occurrence priors, sequential inference, and similarity confusion. It contains 11,306 video clips and 40,493 question-answer pairs, covering binary-choice, multiple-choice, and generative settings. We further introduce a bi-directional questioning protocol with bias-aware metrics to reduce affirmation bias in binary evaluation. Experiments on ten recent open-source VideoLLMs reveal a clear decoupling between action recognition and hallucination resistance, as models that perform well on positive action recognition often fail on adversarial negatives. Among all settings, sequential inference hallucination is the most severe, showing that current models tend to over-infer expected outcomes from partial motion cues. Our analyses further confirm that stronger priors and finer-grained similarity substantially amplify hallucination. We hope MoHallBench can facilitate future evaluation and mitigation of motion hallucination in VideoLLMs.

13. 【2607.01100】CPDDNet: Color-Polarization Denoising and Demosaicking Network

链接https://arxiv.org/abs/2607.01100

作者:Qihang Zhang,Yusuke Monno,Masayuki Tanaka,Masatoshi Okutomi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:color-polarization filter array, color intensity, filter array, single shot, enabling various applications

备注: Presented at ICIP2026 Project Page: [this http URL](http://www.ok.sc.e.titech.ac.jp/res/PolarDem/CPDDNet/)

点击查看摘要

Abstract:Color-polarization imaging using a color-polarization filter array (CPFA) sensor captures both texture (color intensity) and physical (polarization) information of the scene in a single shot, enabling various applications in computer vision. However, the raw mosaic output from a CPFA sensor often suffers from severe noise and resolution loss, especially under low-light conditions. Existing methods generally focus on either denoising or demosaicking tasks, failing to capture the coupling between them and neglecting shared low-level features. In this paper, we propose a color-polarization denoising and demosaicking network (CPDDNet), which is a joint framework that performs noise removal and CPFA interpolation using a feature fusion module that retains the features from the CPFA raw data at both the denoising and the demosaicking stages. Experimental results demonstrate that CPDDNet significantly enhances image quality and polarization parameter accuracy, outperforming existing approaches on a real dataset.

14. 【2607.01086】LongVQUBench: Benchmarking Long-Term Video Quality Understanding of Vision-Language Models

链接https://arxiv.org/abs/2607.01086

作者:Arpita Nema,Hanwei Zhu,Xi Zhang,Weisi Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large vision-language models, long-term video quality, video quality understanding, quality understanding, video quality

备注: Accepted at European Conference on Computer Vision 2026

点击查看摘要

Abstract:The evaluation of long-term video quality understanding remains an open challenge for large vision-language models (LVLMs). Existing video quality benchmarks predominantly focus on short clips and isolated distortions, overlooking the temporal continuity, cumulative degradation, and reasoning complexity inherent in long-duration content. To address these limitations, we present LongVQUBench, a comprehensive benchmark for long-term video quality understanding. LongVQUBench contains over 1200 diverse videos spanning movies, documentaries, surveillance footage, egocentric recordings, and animated content, accompanied by 1500 multiple-choice and open-ended questions for validation and testing. To assess perceptual reasoning across different temporal scopes, we introduce three progressively complex evaluation levels: (i) local event quality understanding (LQU) for analyzing localized distortions; (ii) cross-event quality reasoning (CQR) for integrating multiple degraded events; and (iii) global quality understanding (GQU) for holistic perceptual evaluation over extended durations. Furthermore, a needle distortion question-answering (NDQA) paradigm is embedded across all three levels, where spatial or temporal artifacts are sparsely inserted to probe fine-grained detection and reasoning capabilities. Extensive experiments on 14 state-of-the-art LVLMs reveal significant performance degradation with increasing video length and reasoning depth, highlighting their limited capacity for long-range temporal integration and perceptual attribution. We envision LongVQUBench as a foundational step toward the systematic, hierarchical, and explainable evaluation of LVLMs' long-term video quality understanding.

15. 【2607.01067】Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation

链接https://arxiv.org/abs/2607.01067

作者:Chi Zhang,Penglin Cai,Ziheng Xi,Haoqi Yuan,Hao Luo,Wanpeng Zhang,Sipeng Zheng,Chaoyi Xu,Zongqing Lu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:precise force feedback, inferred from vision, dexterous and contact-rich, force feedback, reliably inferred

备注: The first two authors contribute equally. Orders are decided by flipping a coin

点击查看摘要

Abstract:As an essential modality for dexterous and contact-rich tasks, tactile sensing provides precise force feedback that cannot be reliably inferred from vision. However, limited by hardware and data collection systems, existing datasets with tactility remain small in scale and narrow in contact coverage. Meanwhile, Vision-Language-Action (VLA) models with tactile modality are constrained on dynamics-agnostic post-training, which limits the performance ceiling on downstream tasks. In this paper, we present H-Tac, a large-scale tactile-action dataset with 160-hour egocentric human videos containing more than 300 tasks and 135k episodes. Building upon this, we propose Transferable Tactile Pre-Training (TTP), a system of tactile-based pre-training on human data for fine-grained robotic tasks. To bridge the gap between humans and robots, we use unified tactile and action spaces throughout the pre-training and post-training phases, preserving prior knowledge during human-to-robot transfer. By leveraging a tactile expert for future tactile prediction, our framework explicitly models the contact dynamics and precise physical interactions. Extensive experiments in simulation and on real robots demonstrate that our model achieves superior performance, exhibiting robust generalization and fine-grained manipulation capabilities. TTP paves the way for scalable tactile pre-training via human-to-robot transfer.

16. 【2607.01050】GeoSearcher: Anchor-Guided Progressive Reasoning for Remote Sensing Visual Grounding with Process Supervision

链接https://arxiv.org/abs/2607.01050

作者:Dianyu Wang,Yidan Zhang,Peirong Zhang,Xuyang Li,Xiaoxuan Liu,Lei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent multimodal large, multimodal large language, shown strong cross-modal, strong cross-modal understanding, Recent multimodal

备注: 14 pages,11 figures,7 tables

点击查看摘要

Abstract:Recent multimodal large language models (MLLMs) have shown strong cross-modal understanding and coordinate generation abilities in visual grounding. However, transferring these abilities to remote sensing visual grounding (RSVG) remains challenging. High-resolution remote sensing images usually cover large-scale scenes, where targets are often extremely small and surrounded by numerous visually similar distractors. Meanwhile, queries often contain multiple clues, such as reference objects, spatial relations, and target attributes. Existing MLLM-based methods usually formulate RSVG as one-step coordinate generation, which may lead to unstable predictions for small-object localization and complex queries. To address these challenges, we propose GeoSearcher, which reformulates RSVG as an anchor-guided progressive reasoning process and realizes it through two coupled stages: Anchor-Centric Reasoning Supervised Fine-Tuning (ACR-SFT) and Process-Faithful Group Relative Policy Optimization (PF-GRPO). In ACR-SFT, anchor-centric reasoning data are used to teach the model to represent key visual clues as anchors and progressively integrate location, relational, and attribute clues around them. In PF-GRPO, Process-Aware Reward (PAR) and Reasoning-Informative Sample Selector (RISS) further optimize this reasoning behavior by jointly evaluating key reasoning steps and target localization, while focusing training on samples that are more beneficial for improving progressive reasoning. Through this design, GeoSearcher transforms large-scale visual search into a more constrained local reasoning process. Extensive experiments on DIOR-RSVG, OPT-RSVG, and VRS-Bench show that GeoSearcher outperforms existing state-of-the-art methods. The project will be released at this https URL.

17. 【2607.01049】GenAU: Language-Grounded Industrial Anomaly Understanding with Vision-Language Models

链接https://arxiv.org/abs/2607.01049

作者:Hongkuan Zhou,Tristan Rehm,Nadeem Nazer,Lavdim Halilaj,Jingcheng Wu,Steffen Staab

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Industrial inspection requires, interpretable visual evidence, provide interpretable visual, industrial Anomaly Understanding, binary anomaly detection

备注

点击查看摘要

Abstract:Industrial inspection requires more than binary anomaly detection: a practical system should determine whether an anomaly exists, localize the defective region, identify the defect type, and provide interpretable visual evidence. Existing CLIP-based methods detect and localize anomalies well but offer limited language-level defect understanding, while instruction-tuned vision-language models can describe defects but do not natively produce pixel-level masks. We introduce GenAU, a Generalist vision-language framework for industrial Anomaly Understanding that unifies image-level detection, pixel-level segmentation, multi-type anomaly detection, and defect analysis in a single instruction-following model. GenAU augments a vision-language model with two segmentation tokens, [SEG_defect] and [SEG_normal], whose hidden states act as language-grounded queries over multi-scale visual features for pixel-level localization; the image-level score fuses this map with the decoder's textual normal/defect decision, while the language decoder produces structured defect-aware responses. Trained with a joint language-modeling and segmentation objective, GenAU covers all four tasks within one architecture and recipe, adding zero-shot multi-type detection and language-grounded defect analysis at a quantified cost to detection and segmentation. Across cross-dataset benchmarks, GenAU attains the strongest image-level detection among CLIP-based zero-shot methods on VisA and Real-IAD, with segmentation approaching but not surpassing specialized CLIP baselines.

18. 【2607.01039】EchoRisk: A Multicentre Echocardiography Dataset and Benchmark for Cardio-Oncology

链接https://arxiv.org/abs/2607.01039

作者:Grigorios Kalliatakis,Georgia Karanasiou,Georgios Manikis,Manolis Tsiknakis,Dimitrios Fotiadis,Dorothea Tsekoura,Kalliopi Keramida,Vasileios Bouratzis,Lampros Lakkas,Katerina Naka,Andri Papakonstantinou,Anastasia Constantinidou,Kostas Marias

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:automated risk stratification, breast cancer patients, leading non-oncological, treatment interruption, interruption in breast

备注: Primary technical reference for the EchoRisk-MICCAI 2026 challenge, accepted as a satellite event at MICCAI 2026

点击查看摘要

Abstract:Therapy-induced cardiotoxicity is the leading non-oncological cause of treatment interruption in breast cancer patients, yet early, automated risk stratification from routine cardiac imaging remains an unsolved problem. We present EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, released as the primary technical reference for the EchoRisk-MICCAI 2026 challenge. The dataset comprises 422 patients enrolled in the EU-funded CARDIOCARE prospective study across five European sites, yielding 2,159 echocardiography videos across 1,123 clinical exams acquired at up to five longitudinal timepoints, alongside a dedicated cohort of 280 patients with baseline imaging for early cardiotoxicity prediction. Three clinically grounded tasks are defined: automated estimation of left ventricular ejection fraction from cine video (Task 1), classification of LV dysfunction from longitudinal imaging (Task 2), and early prediction of therapy-induced cardiotoxicity from pre-therapy baseline echocardiography alone (Task 3). For each task we specify the evaluation protocol, primary and secondary metrics, and ranking procedure. We establish baseline performance using an R(2+1)D video backbone with LSTM aggregation trained from Kinetics-400 pretrained weights, demonstrating strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early cardiotoxicity prediction from a single pre-therapy video remains a significant open problem for the community. The dataset, evaluation code, and baseline implementations are publicly available to serve as a benchmark for further collaboration, comparison, and the creation of task-specific architectures in cardio-oncology.

19. 【2607.01018】Reading Order Inference for Complex Document Layouts

链接https://arxiv.org/abs/2607.01018

作者:Iddo Hakim,Sharva Gogawale,Omer Ventura,Gal Grudka,Daria Vasyutinsky-Shapira,Berat Kurar-Barakat,Nachum Dershowitz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:interleaved reading streams, spatially interleaved reading, multiple spatially interleaved, Reading order, complex historical manuscripts

备注

点击查看摘要

Abstract:Reading order inference remains a critical bottleneck in the digitization of complex historical manuscripts, where pages contain multiple spatially interleaved reading streams, the canonical example being the Glossa Ordinaria layout, in which a central text is surrounded by commentaries that wrap around it in non-rectangular, non-convex regions. We present a training-free, graph-based framework: each OCR text line becomes a node in a directed candidate-transition graph, edges are scored by a weighted additive ensemble of two lightweight language-model signals (causal language model conditional likelihood and BERT next-sentence prediction, NSP; a third sentence-embedding signal was evaluated but did not improve reading order), and the global reading order is recovered as a degree-constrained directed path cover. To avoid the cascading "edge-theft" failures of greedy edge selection, we propose a max-regret inference rule that prioritizes commitments with high opportunity cost. We evaluate on synthetic Glossa Ordinaria grid layouts, on 23 ALTO page geometries (10 historical source pages plus mirrored and flipped variants), and on a 140-page multi-column English subset of OmniDocBench, comparing our method against the canonical recursive XY-cut (PaddleOCR PP-StructureV3) and two LayoutReader variants (layout-only and text+layout) on identical inputs. On wrap-around Glossa layouts our method recovers 95% of ground-truth successor edges on average vs. XY-cut's 50%; on the OmniDocBench multi-column subset it reaches 88% macro edge accuracy versus XY-cut's 75% and LayoutReader's 25%. The LayoutReader baselines transfer poorly due to a word-level vs. line-level granularity mismatch. We additionally verify mirror-invariance under horizontal and vertical page reflections: Our method changes by less than 1 percentage point, classical XY-cut by 2 points, and LayoutReader-T by up to 8 points.

20. 【2607.01015】SuperFlex: Deformable Superquadrics for Point Cloud Decomposition

链接https://arxiv.org/abs/2607.01015

作者:Gabriel Tavernini,Elisabetta Fedele,Tiago Novello,Leonidas Guibas,Marc Pollefeys,Francis Engelmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geometrically meaningful representation, geometrically meaningful, proven to provide, reconstruction accuracy, point clouds

备注: Project page: [this https URL](https://superflex3d.github.io)

点击查看摘要

Abstract:Superquadrics have proven to provide a compact, geometrically meaningful representation for 3D objects. However, existing methods suffer from limited reconstruction accuracy, are restricted to rigid primitives, and lack robustness to partial point clouds. In this work, we present SuperFlex, an enhanced framework that expands the expressive power and applicability of superquadric decompositions. First, we introduce a novel loss formulation which significantly improves reconstruction accuracy. Second, we include bending and tapering deformations, enabling high-fidelity representation of curved and asymmetric geometries. Finally, we leverage these high-quality decompositions as supervision to train a model that is robust to partial real-world point clouds. Experiments demonstrate substantial improvements in reconstruction accuracy over both optimization- and learning-based baselines while maintaining a highly compact primitive representation.

21. 【2607.01001】Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

链接https://arxiv.org/abs/2607.01001

作者:Nils Neukirch,Martin Maurer,Nils Strodthoff

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:rarely isolate contributions, foundation models rarely, models rarely isolate, lung cancer phenotyping, CT-based lung cancer

备注: 17 pages, 8 figures, 2 tables, Code is available at [this https URL](https://github.com/AI4HealthUOL/lung-ct-benchmarking)

点击查看摘要

Abstract:Radiomics is the established approach for CT-based lung cancer phenotyping, yet comparisons with foundation models rarely isolate contributions of feature extractor, classification head, and segmentation choice, or test cross-cohort robustness. We benchmark five feature extractors (Curia, Curia-2, DINOv3, Radiomics2D, Radiomics3D), seven classification heads (TabPFN, TabICL, XGBoost, CatBoost, Random Forest, logistic regression, Ridge), and three segmentation regimes on five tasks: tumor volume and stage classification, 2-year survival prediction, histology classification, and age prediction. Models are trained on LUNG1 (n=338) and evaluated on an internal test set (n=84) and the external LUNG2 cohort (n=211), with worst-case cross-cohort performance as the primary metric. The dominant design factor is task-dependent: segmentation drives volume and stage classification, while classifier choice drives survival, histology, and age prediction. Radiomics is competitive for tumor volume, tumor stage and survival (partly due to label-derivation effects for the former); Curia variants reach comparable peak scores for survival; DINOv3 falls slightly short across tasks. Patch and slice aggregation have negligible impact. We recommend Curia with tumor segmentation and a CatBoost head as a safe default, achieving the best mean rank across the three primary clinical tasks, though task-specific selection consistently outperforms any cross-task default. When tumor delineations are unavailable, Curia-2 with lung segmentation and logistic regression offers a competitive alternative. All pipelines use a two-stage design suited to small cohort sizes where end-to-end fine-tuning would risk overfitting.

22. 【2607.00987】AVSR-Diff: Scale-Agnostic Diffusion Priors for Temporally Consistent Arbitrary-Scale Video Super-Resolution

链接https://arxiv.org/abs/2607.00987

作者:Geunhyuk Youk,Jeonghyeok Do,Dayeon Kim,Jihyong Oh,Munchurl Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain largely constrained, significantly advanced video, fixed upsampling scales, significantly advanced, remain largely

备注: Accepted to ECCV 2026. Project page: [this https URL](https://kaist-viclab.github.io/AVSR-Diff/)

点击查看摘要

Abstract:Diffusion models have significantly advanced video super-resolution (VSR) but remain largely constrained to fixed upsampling scales. Conversely, while coordinate-based arbitrary-scale VSR methods offer scale flexibility, they inherently suffer from severe over-smoothing at large scaling factors. Integrating generative priors with continuous decoding is promising but currently hindered by severe temporal flickering caused by the stochasticity of diffusion sampling. To address this, we propose AVSR-Diff (Arbitrary-scale Video Super-Resolution with Diffusion), a novel decoupled framework that separates scale-agnostic latent denoising from continuous coordinate rendering, effectively avoiding computationally heavy resolution-specific sampling. Our approach introduces a Temporally-Gated Feature Recurrence (TGFR) module to extract strictly aligned, temporally consistent latent priors. Furthermore, we design a continuous video VAE decoder incorporating a Scale-Aware Fourier Refinement (SAFR) module to dynamically adapt frequency components to any target scale. Extensive experiments demonstrate that AVSR-Diff consistently preserves high-frequency details and strong temporal stability across various scales, surpassing state-of-the-art arbitrary-scale baselines. Remarkably, our framework outperforms recent fixed-scale generative models even on their native resolution.

23. 【2607.00983】QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding

链接https://arxiv.org/abs/2607.00983

作者:Jun Peng,Baiyang Song,Jie Li,Hui Li,Yiyi Zhou,Rongrong Ji,Yonghong Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe temporal redundancy, processing dense frame, dense frame sequences, computationally expensive, plagued by severe

备注

点击查看摘要

Abstract:Video understanding is often plagued by severe temporal redundancy, where processing dense frame sequences is both semantically inefficient and computationally expensive. This challenge is further amplified when only a small subset of frames is truly relevant to the given query. In this paper, we propose a Query- and Content-Aware (QCA) keyframe selection framework that can select a compact yet information-rich set of frames from long videos. QCA first partitions the video into temporal segments and estimates the information contribution of each segment by jointly modeling query relevance and content deviation, and dynamically allocates keyframe budget to each segment. Within each segment, QCA anchors on the most query-relevant frame and iteratively incorporates additional frames to maximize diversity while maintaining high semantic relevance to the query. Crucially, our method requires no additional training and can be seamlessly integrated into existing Video-LLMs. Extensive experiments across multiple long video understanding benchmarks demonstrate that our proposed approach achieves state-of-the-art performance and has strong generalization ability. For instance, QCA achieves 67.8\% on LongVideoBench using 128 frames, while GPT-4o achieves 66.7\% using 256 frames. Our codes are available in \href{this https URL}{GitHub}.

24. 【2607.00978】Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization

链接https://arxiv.org/abs/2607.00978

作者:Xuying Huang,Sicong Pan,Maren Bennewitz

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:real-world indoor environments, scene understanding systems, requirement for deploying, scene understanding, indoor environments

备注

点击查看摘要

Abstract:Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.

25. 【2607.00975】RCGL-Net: A Long-Tailed Multi-Label Chest X-Ray Classification Framework with Generative Data Augmentation and Label Co-Occurrence Modeling

链接https://arxiv.org/abs/2607.00975

作者:Tong Shao,Hongshun Ling,Li Zhang,Jinjing Wu,Junke Wang,Yuan Gao,Fang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:medical imaging diagnosis, intelligent medical imaging, Chest X-ray multi-label, imaging diagnosis, Chest X-ray

备注

点击查看摘要

Abstract:Chest X-ray multi-label classification is a core task in intelligent medical imaging diagnosis. However, real clinical data often exhibit extreme long-tailed distributions, leading to degraded performance on rare diseases in tail classes. This issue is not only driven by data scarcity but also by two intrinsic factors:1) attenuation of tail-class lesion representations under complex anatomical backgrounds, and 2) dominance of head classes in modeling label co-occurrence relationships. To address these challenges, we propose TRCGL-Net. First, a learnable text-guided conditional diffusion model is employed to generate high-quality tail-class chest X-ray image samples under disease semantic constraints, improving data diversity and realism of rare disease patterns while alleviating class imbalance and preserving pathology-consistent this http URL, a channel reweighting mechanism is introduced to perform feature recalibration by emphasizing disease-relevant feature channels, thereby improving feature discriminability under long-tailed distributions.A class-aware attention mechanism is further applied to generate class-specific attention maps, enabling the model to localize disease-relevant regions and focus on fine-grained lesion this http URL, a graph convolution network based on label co occurrence is introduced to establish an information propagation mechanism among categories. Experiments on the PadChest dataset show that the proposed method achieves a tail-class mAP of 0.4904, an overall mAP of 0.4408, and an mAUC of 0.8989, outperforming state-of-the-art methods. TRCGL-Net effectively improves recognition performance for rare diseases under long-tailed distributions and mitigates the impact of extreme class imbalance in chest X-ray multi-label classification.

26. 【2607.00974】QuaMoE-DRF: Proactive Beam and Rate Adaptation via Multimodal Dynamic Radio Map Forecasting in ISAC Networks

链接https://arxiv.org/abs/2607.00974

作者:Zhihan Zeng,Kaihe Wang,Zhongpei Zhang,Chongwen Huang

类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)

关键词:location-dependent propagation priors, provide location-dependent propagation, maps provide location-dependent, capture short-term blockage, short-term blockage caused

备注

点击查看摘要

Abstract:Static radio maps provide location-dependent propagation priors, but they cannot capture short-term blockage caused by moving objects. Direct sensing-assisted beam prediction is also limited because a beam index discards SINR margins, MCS thresholds, BS alternatives, and communication-equivalent neighboring beams. This paper proposes QuaMoE-DRF, a quality-aware multimodal dynamic radio map forecasting framework for proactive beam and rate adaptation in ISAC networks. Its core representation is a future beam-SINR field. We show that the full multi-BS beam-SINR field is sufficient for finite-codebook threshold-rate BS, beam, MCS, goodput, and outage decisions. For tractability, the implemented model learns a compact reference-BS local field, complemented by BS-level supervision, joint BS--beam supervision, and latent network context; we also clarify that this compact projection alone is not sufficient for BS association. QuaMoE-DRF fuses static geometry, event-like motion observations, structured sensing states, and wireless history through a quality-aware mixture-of-experts module motivated by inverse-variance fusion under heteroscedastic modality errors. It jointly predicts communication-oriented map channels and proactive BS, beam, and MCS decisions. On a dynamic multi-BS and multi-UE urban benchmark, QuaMoE-DRF achieves 402.5 Mbps effective rate, 0.0417 outage probability, and 0.1836 map RMSE, improving the effective rate by 5.67% and reducing outage by 8.35% over the strongest completed effective-rate baseline. The current validation uses labels from a compact blockage/path-loss simulator, with ray tracing used only for calibration and sanity checking.

27. 【2607.00965】Slope-Guided Mamba and Angular-Refined Transformer for Light Field Super-Resolution

链接https://arxiv.org/abs/2607.00965

作者:Li Jin,Jian Huang,Junde Lu,Shuai Wang,Hao Sheng,Jie Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:necessitates accurate modeling, Light Field Super-Resolution, ray coherence, necessitates accurate, preserving intrinsic

备注: 10 pages, 4 figures, 4 tables. Accepted by IEEE ICME 2026. Hangzhou International Innovation Institute, Beihang University, Hangzhou, China Corresponding author: Jie Wu (jiewu@buaa. [this http URL](http://edu.cn) ) Emails: {lijin01, hj, ljd2406107, shuaiwang, shenghao, jiewu}@buaa. [this http URL](http://edu.cn)

点击查看摘要

Abstract:Light Field Super-Resolution (LFSR) necessitates accurate modeling of spatial-angular correlations while preserving intrinsic 4D ray coherence. However, maintaining such high-dimensional consistency remains challenging, primarily due to two inherent limitations in prevailing modeling paradigms. First, spatial and angular dimensions are often modeled in a decoupled manner, restricting early cross-dimensional interaction and leading to geometric inconsistencies. Moreover, although continuous sequence modeling paradigms show promise in representing epipolar structures, their rigid scanning mechanisms fundamentally conflict with epipolar geometry, limiting geometry-aware feature aggregation. To address these challenges, we propose a hybrid light field super-resolution network, termed SMART, which integrates a Slope-Guided Mamba and an Angular-Refined Transformer to effectively overcome these limitations. Specifically, we introduce an angular-modulated spatial module to bridge the decoupling gap, incorporating angular priors to strengthen spatial-angular correlation modeling. To mitigate the scan-geometry mismatch, we propose a manifold-aligned trajectory module that enables geometry-consistent sequence modeling along epipolar structures. Experiments on five benchmarks demonstrate that SMART achieves state-of-the-art performance, surpassing previous methods by 0.42 dB (PSNR) with significantly reduced artifacts.

28. 【2607.00959】GaussianEmoTalker: Real-Time Emotional Talking Head Synthesis with Audio-Driven and Blendshape-Based 3D Gaussian Splatting

链接https://arxiv.org/abs/2607.00959

作者:Haijie Yang,Zhenyu Zhang,Yixuan Dong,Jianjun Qian,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved impressive progress, talking head synthesis, intensity remains challenging, Audio-driven talking head, remains challenging

备注

点击查看摘要

Abstract:Audio-driven talking head synthesis has achieved impressive progress in lip synchronization and visual quality, yet generating expressive emotional avatars with controllable intensity remains challenging, especially under real-time constraints. In this paper, we present GaussianEmoTalker, an audio-driven framework for real-time emotional talking head synthesis based on 3D Gaussian Splatting. Instead of directly predicting the final emotional avatar from speech, we formulate emotional animation as a neutral-to-emotional residual deformation problem. GaussianEmoTalker first constructs an identity-specific neutral talking space with GaussianBlendshapes, which provides high-fidelity Gaussian attributes and phoneme-synchronized neutral motion. It then predicts an emotion-conditioned residual deformation by combining mesh displacement cues, audio features, emotion categories, and intensity encodings. To fuse these heterogeneous signals, we introduce a spatial-audio-emotion attention module that estimates the offsets of Gaussian attributes for expressive and temporally stable rendering. Extensive experiments demonstrate that GaussianEmoTalker achieves competitive video quality, accurate lip synchronization, controllable emotional expression, and real-time rendering compared with recent emotional talking head methods. Our project page is available at this https URL

29. 【2607.00955】Learning Cardiac Motion Priors for Implicit Neural Representations

链接https://arxiv.org/abs/2607.00955

作者:Andrew Bell,George Webber,Andrew P King,Steffen E Petersen,Muhummad Sohaib Nazir,Alistair Young

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Implicit neural representations, Implicit neural, providing continuous, compact representations, neural representations

备注

点击查看摘要

Abstract:Implicit neural representations (INRs) are well suited to cardiac motion estimation, providing continuous, compact representations of motion fields. However, fitting an INR to each image sequence is time-consuming and sensitive to the optimisation trajectory. Learned priors can help guide optimisation towards plausible motion fields and enable faster adaptation, but learning priors for cardiac motion INRs remains under-explored. In this work, we compare four strategies for learning cardiac motion priors, including a population prior learned by joint optimisation, a consensus prior obtained by weight averaging, auto-decoders, and meta-learning. Using short-axis tagged cardiac magnetic resonance images from the UK Biobank, we evaluate their impact on tracking accuracy, motion behaviour, and adaptation trajectory. All learned priors substantially improved early adaptation performance compared with random initialisation. While the simple consensus prior was effective, auto-decoders recovered large deformations faster during early adaptation. Meta-learning achieved strong early performance and maintained the best adaptation trajectory over 50 iterations.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2607.00955 [cs.CV]

(or
arXiv:2607.00955v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2607.00955

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
30. 【2607.00948】Dataset Biases and Shortcut Learning in Motion-Based AI-Generated Video Detection

链接https://arxiv.org/abs/2607.00948

作者:Joren Michels,Lode Jorissen,Nick Michiels

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent years, making it increasingly, synthetic media, visual quality, improved drastically

备注

点击查看摘要

Abstract:The visual quality of AI-generated videos has improved drastically in recent years, making it increasingly difficult for humans to distinguish between real and synthetic media. In this work, we evaluate the robustness and applicability of four state-of-the-art motion-based AI-generated video detectors. We identify significant preprocessing and sampling biases in these methods and demonstrate that they account for a substantial portion of their reported performance. Furthermore, we find that these detectors are highly sensitive to motion patterns specific to their evaluation datasets, where AI-generated videos generally exhibit less inter-frame movement than real videos. We show that for all detectors, performance collapses to near-random levels when evaluated on a dataset that does not contain this motion bias. Additionally, through dataset rebalancing and the application of simple spatial augmentations, we observe severe performance degradation across all evaluated models. In contrast, we find that an existing frequency-based detector maintains strong performance across all evaluated datasets, suggesting that frequency-based approaches may offer a more generalizable path forward for AI-generated video detection. We hope that our work raises awareness towards these vulnerabilities and encourages the development of more representative, unbiased datasets and more robust evaluation protocols.

31. 【2607.00927】Post-Training Pruning for Diffusion Transformers

链接https://arxiv.org/abs/2607.00927

作者:Chengzhi Hu,Xuewen Liu,Jing Zhang,Mengjuan Chen,Zhikai Li,Qingyi Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion Transformers, substantial computational overhead, demonstrated impressive performance, resource consumption, demonstrated impressive

备注: 15 pages, 13 figures

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have demonstrated impressive performance in image generation but suffer from substantial computational overhead and resource consumption. Post-training pruning offers a promising solution; however, due to DiTs' unique architectural design and parameter distribution, traditional pruning methods are inapplicable, leading to significant performance degradation. Specifically, prior methods developed for LLMs, which derive metrics through a series of approximations, amplify the relative contribution of weights in the saliency metric. In addition, weights in DiTs exhibit significantly larger magnitudes than those in LLMs. Moreover, existing pruning granularity overlooks variations in model structures. In this paper, we propose DiT-Pruning, which improves pruning performance by introducing customized saliency criteria and pruning granularity. We design a novel metric that balances the contributions of weights and activations from an energy-based perspective, enabling more effective identification of important elements. Furthermore, we observe distinct clustering patterns in the two-dimensional weight space. Accordingly, we adopt a clustering-aware pruning granularity, enabling effective sparse allocation. Extensive evaluations on various DiTs show that our method consistently preserves image quality, especially under high sparsity. For FLUX.1-dev at 512x512 resolution on MJHQ, DiT-Pruning achieves only a 0.001 loss in CLIP score at 50% sparsity, dramatically outperforming recent pruning methods.

32. 【2607.00920】GMO-E$^2$DIT: Grounded Multi-Operation Editing for E-Commerce Images

链接https://arxiv.org/abs/2607.00920

作者:Zipeng Guo,Xiaoan Liu,Lichen Ma,Cheng Wang,Yu He,Xiaolong Fu,Jingling Fu,Xinyuan Shan,Shaojie Guo,Luohang Liu,Junshi Huang,Yan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-world e-commerce image, Real-world e-commerce, requires multiple, global restyling, e-commerce image editing

备注

点击查看摘要

Abstract:Real-world e-commerce image editing often requires multiple, localized, and auditable operations rather than global restyling. This compositional nature poses a dual challenge: models must precisely apply all requested edits to the correct regions while preserving unmodified content, even under ambiguous instructions. Existing one-shot editors conflate intent resolution, spatial grounding, and synthesis into a single step, frequently resulting in partial execution failures, which is unacceptable for commercial scenarios. To address this, we introduce GMO-E$^2$DIT, an agentic editing framework that couples a Vision-Language Model (VLM) with a mask-conditioned image editor to tackle structured multi-turn task completion. Given an underspecified instruction, the VLM agent constructs a region-grounded edit agenda, effectively decoupling cognitive reasoning from generative rendering. The framework then executes sub-programs via operation-aware masks and references, utilizing a reflection-driven loop to inspect intermediate results and determine the subsequent state. This iterative mechanism reliably preserves safe partial progress, retries unfinished operations, and recovers from errors. Furthermore, we develop a unified data pipeline providing aligned supervision for planning, execution, and reflection, alongside EComEditBench, a comprehensive benchmark for instruction-driven evaluation. Extensive experiments demonstrate that GMO-E$^2$DIT achieves competitive performance compared to strong closed-source models, yielding superior instruction accuracy and edit fidelity over existing baselines.

33. 【2607.00916】Condensing Large-Scale Datasets Directly with Minimal Information Loss

链接https://arxiv.org/abs/2607.00916

作者:Xinyi Shang,Peng Sun,Bei Shi,Zixuan Wang,Tao Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distillation rely heavily, scaling dataset distillation, dataset distillation rely, comprising SQUEEZE, Recent advancements

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Recent advancements in scaling dataset distillation rely heavily on decoupled information extraction pipelines, comprising SQUEEZE, RECOVER, and RELABEL stages. Despite their scalability to large-scale datasets, these methods suffer from prohibitive computational overhead and poor cross-architecture generalization. In this paper, we reveal the root cause of these bottlenecks: the implicit dual-compression process, from data to model and back to images, inherently induces severe information loss. Crucially, we empirically and theoretically demonstrate that this loss creates a distribution shift that fundamentally compromises the widely adopted RELABEL strategy, transforming the pre-trained model into an unreliable labeler that yields sub-optimal labels. To overcome these critical flaws, we propose CIM, a novel, metric-driven framework that abandons the flawed dual-compression paradigm. Instead, CIM explicitly quantifies and minimizes the information gap between the original and synthetic datasets. By directly aligning the data distributions, our approach ensures high-fidelity information condensation and inherently satisfies the prerequisites for effective relabeling. Extensive experiments demonstrate that CIM establishes a new state-of-the-art. Notably, it distills ImageNet-1K at an IPC=10 in merely 80 minutes on a single RTX-4090 GPU, achieving an unprecedented 48.7% Top-1 accuracy on ResNet-18 and significantly outperforming previous SOTA approaches, such as NRR-DD and DELT, by 2.6% and 2.9%, respectively. Our code is available at this https URL.

34. 【2607.00902】MG-RWKV: Multi-Grained Context-Aware RWKV for Temporal Forgery Localization

链接https://arxiv.org/abs/2607.00902

作者:Jingchen Ni,Cangjin Yu,Dan Jiang,Quan Zhang,Keyu Lv,Shannan Yan,Linyue Pan,Ke Zhang,Chun Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Artificial Intelligence-Generated Content, Driven by Artificial, facing severe challenges, Artificial Intelligence-Generated, Intelligence-Generated Content

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Driven by Artificial Intelligence-Generated Content (AIGC), the authenticity of audio-visual content is facing severe challenges. Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within untrimmed sequences. However, existing methods are limited by CNNs' local receptive fields or Transformers' quadratic complexity, while emerging linear models often struggle to balance global authentic context compression with local abrupt forgery perception. To address this, we propose MG-RWKV, a multi-granularity framework that leverages the data-dependent state evolution of RWKV to achieve efficient full-sequence processing with O(T) complexity. Our framework features three core innovations: (1) a Bidirectional RWKV architecture that captures bidirectional temporal contexts without quadratic overhead; (2) a Multi-Granularity Mixture of Experts (MG-MoE) that performs dynamic routing over explicit temporal receptive fields, adaptively selecting granularities based on forgery duration to significantly enhance decision interpretability; and (3) Cross-Granularity Consistency (CGC), which aligns adjacent feature pyramid levels through hierarchical scale-wise pairing and spatial boundary-aware weighting, effectively reducing false positives in authentic regions. Extensive experiments on Lav-DF, TVIL, and Psynd datasets demonstrate that MG-RWKV achieves state-of-the-art performance with low computational cost.

35. 【2607.00889】DeWorldSG: Depth-Aware 3D Semantic Scene Graph Generation via World-Model Priors

链接https://arxiv.org/abs/2607.00889

作者:Seok-Young Kim,Abdelrahman Elskhawy,Taewook Ha,Dooyoung Kim,Eunjae Shin,Benjamin Busam,Woontack Woo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Semantic Scene Graphs, generates spatio-temporally robust, RGB-D sequences, Semantic Scene, scene graphs due

备注: 19 pages, 6 figures, ECCV 2026

点击查看摘要

Abstract:We present DeWorldSG, a novel framework that generates spatio-temporally robust 3D Semantic Scene Graphs from RGB-D sequences. Existing methods often struggle to construct reliable 3D scene graphs due to unstable 3D object representations and missing relations caused by frame-wise inference. DeWorldSG addresses these issues by estimating instance-level geometric 3D Gaussian distributions through depth-guided filtering and representing each object as a probabilistic 3D node rather than a single projected point. To mitigate relational sparsity from frame-wise inference, our framework further aggregates spatiotemporal evidence across object pairs and refines relations using contextual priors derived from a world model (V-JEPA 2). Experiments on the 3DSSG and ReplicaSSG datasets demonstrate state-of-the-art (SoTA) performance in both object and predicate prediction, while producing temporally consistent scene structures. In particular, our method improves triplet recall by 77.4% and predicate recall by 23.2% over prior SoTA approaches, making it suitable for robotic manipulation and AR applications. Our code and models are open-sourced.

36. 【2607.00887】Geometry-Aware Cross-Height Channel Knowledge Map Prediction for UAV-Assisted Communications With Uncertainty-Guided 3D Sensing

链接https://arxiv.org/abs/2607.00887

作者:Zhihan Zeng,Amir Hussain,Yue Xiu,Phee Lep Yeoh,Lu Chen,Zhongpei Zhang,Guan Gui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-altitude Unmanned Aerial, Unmanned Aerial Vehicles, Low-altitude Unmanned, infer channel knowledge, channel knowledge map

备注

点击查看摘要

Abstract:Low-altitude Unmanned Aerial Vehicles (UAVs) often need to infer channel knowledge across a range of heights from only sparse observations collected at a few altitude layers. To address this challenge, this paper studies height-conditioned cross-height channel knowledge map (CKM) prediction for UAV-assisted communications in geometry-rich urban environments. We develop a geometry-aware conditional prediction framework that combines urban scene priors, sparse multi-altitude observations, and target-height descriptors to reconstruct dense CKMs at unobserved target heights. An uncertainty head is further introduced to characterize prediction confidence and to support cost-aware online UAV sensing under motion and safety constraints. Experiments on a layered aerial CKM benchmark show that the proposed Feature Pyramid Network (FPN)-Transformer achieves the best overall performance under both unseen-scene zero-shot and legacy patch-random protocols, reducing the Root Mean Square Error (RMSE) to 5.347dB and 1.111dB, respectively, compared with 6.937dB and 1.221dB for the strongest baseline 3D-RadioDiff. Moreover, after applying our unseen-scene few-shot adaptation, the RMSE further decreases from 5.347dB in zero-shot prediction to 3.518dB with 10-shot two-height support, while the uncertainty-guided cost-aware sensing policy improves active reconstruction from 6.94dB at initialization to 4.79dB at sensing budget 40, outperforming uncertainty-only sensing at 5.08dB and random aerial sampling at 5.84dB.

37. 【2607.00886】Beyond Pixel Overlap: A Framework for Decomposing Segmentation Evaluation Metrics

链接https://arxiv.org/abs/2607.00886

作者:Youwei Pang,Xiaoqi Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:binary target segmentation, progress is measured, central to binary, determine how progress, binary target

备注

点击查看摘要

Abstract:Evaluation metrics are central to binary target segmentation because they determine how progress is measured, compared, and interpreted. In this paper, target denotes the task-defined positive region to be segmented rather than a generic foreground object. It may be salient, camouflaged, transparent, glass-like, mirror-like, shadow-like, lesion-like, or defined by other application-specific semantics. We treat existing metrics as compositions of modular design choices rather than isolated formulas. The proposed framework decomposes each metric into five stages covering prediction representation, target extraction, target matching, score computation, and metric reporting. We use this framework to analyze representative metrics and show how newer metrics address specific limits in earlier protocols. The stage choices keep each metric's assumptions visible. We then discuss the design space opened by the framework and its implications for task-aware evaluation protocols. Reference code is available at this https URL.

38. 【2607.00885】Improving Sparse-View 3DGS Generalization via Flat Minima Optimization

链接https://arxiv.org/abs/2607.00885

作者:Kangmin Seo,Sangeek Hyun,MinKyu Lee,Jae-Pil Heo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:highly efficient representation, Recent advances, enabling fast training, Gaussian Splatting, neural rendering

备注: Accepted to ECCV 2026. Project Page: [this https URL](https://kangrnin.github.io/FlatMinGS)

点击查看摘要

Abstract:Recent advances in neural rendering have established 3D Gaussian Splatting (3DGS) as a highly efficient representation for novel view synthesis, enabling fast training and real-time rendering with strong fidelity. However, when supervision is limited to sparse input views, 3DGS tends to overfit to the observed images and generalize poorly to unseen viewpoints. We address this challenge from the perspective of flat minima (FM) optimization, which seeks solutions that remain stable under small parameter perturbations. Viewing Gaussian parameters as trainable weights, we adapt FM principles to the geometric and dynamic nature of 3DGS with a lightweight training framework. Our method regularizes optimization with controlled Gaussian perturbations that account for each Gaussian's anisotropy and the training progress, preserving fine details while improving robustness to sparse-view overfitting. To further stabilize this flat minima optimization process, we introduce periodic reinitialization, which temporarily returns non-positional parameters to their initial states for a short window. Together, these techniques integrate seamlessly into existing 3DGS pipelines without architectural changes. Experiments on LLFF and Mip-NeRF360 datasets demonstrate improved quantitative metrics and perceptual quality under sparse-view supervision, producing reconstructions that are sharper, more stable, and better generalized to novel viewpoints.

39. 【2607.00881】OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping

链接https://arxiv.org/abs/2607.00881

作者:Xudong Li,Mengdan Zhang,Peixian Chen,Jiaxi Tan,Zihao Huang,Jingyuan Zheng,Yan Zhang,Xiawu Zheng,Xing Sun,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, basic object recognition, requires coherent spatial

备注

点击查看摘要

Abstract:Spatial intelligence remains a persistent challenge for Multimodal Large Language Models (MLLMs), as it requires coherent spatial scene representations beyond basic object recognition. Existing methods typically build such representations through textual reasoning or 3D reconstruction. However, they often falter during multi-step reasoning, particularly when required to dynamically re-anchor evidence to the specific camera-, object-, or direction-centric reference frames demanded by complex queries. To address this, we propose OmniView-Space, a framework designed to maintain spatial consistency through multimodal egocentric evidence. Our approach consists of three core components: (1) Multi-Perspective Spatial Mapping (MPSM), which re-anchors reconstructed geometry into a query-aligned visual cognitive map and a textual spatial graph; (2) Tool-Guided Egocentric Reasoning, an interleaved policy trained to actively select the ego anchor required by the query and request the corresponding MPSM evidence; and (3) Cognitive-Map Distillation, which uses MPSM-generated trajectories and ego-frame rewards to train the model to reason with self-generated cognitive maps. Experiments on single- and multi-image spatial reasoning benchmarks show that OmniView-Space achieves state-of-the-art performance. Furthermore, the distilled model maintains this performance while reducing reliance on external geometry pipelines.

40. 【2607.00867】EFlow: Learning Evidence Flow for Long-Video Reasoning with Adaptive Reflection

链接https://arxiv.org/abs/2607.00867

作者:Wenhao Zhang,Kuanwei Lin,Xuyi Yang,Wei Gao,Ge Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:utilize visual evidence, temporal grounding, fundamentally constrained, acquire and utilize, utilize visual

备注

点击查看摘要

Abstract:Long-video reasoning is fundamentally constrained by how models acquire and utilize visual evidence. Existing tool-augmented video frameworks often interleave temporal grounding and answer reasoning within a single trajectory, causing early semantic hypotheses to bias evidence localization. We term this failure mode premature semantic commitment, where biased grounding retrieves incomplete evidence and incomplete evidence further reinforces incorrect reasoning. To address this issue, we propose EFlow, an evidence-first video reasoning framework built upon Qwen3-VL. EFlow explicitly separates temporal grounding and logical reasoning through CoT for Temporal Grounding and CoT for Reasoning, enabling the model to retrieve relevant evidence before answer inference. In addition, EFlow introduces a confidence-aware reflection mechanism that re-evaluates the full video when retrieved evidence is potentially insufficient. We further construct dedicated trajectory datasets and train EFlow through supervised fine-tuning, reinforcement learning, and reinforcement fine-tuning. Extensive experiments across five video understanding benchmarks demonstrate that EFlow consistently improves long-video reasoning performance.

41. 【2607.00861】rajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

链接https://arxiv.org/abs/2607.00861

作者:Omer Sela,Inbar Huberman-Spiegelglas,Michael Rotman,Sagie Benaim,Avi Ben-Cohen

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:generation requires preserving, Controlling the motion, requires preserving object, preserving object identities, generation requires

备注: Project page: [this https URL](https://sela-omer.github.io/traj-loc/) Code: [this https URL](https://github.com/Sela-Omer/traj-loc)

点击查看摘要

Abstract:Controlling the motion of multiple objects in image-to-video (I2V) generation requires preserving object identities while enforcing adherence to distinct target trajectories. This becomes particularly challenging as the number of objects increases and their paths intersect or occlude one another. Existing approaches entangle multiple trajectories within a shared, dense conditioning signal, making object-level correspondence difficult to preserve in crowded scenes. We depart from this paradigm and enforce a strict, per object spatial constraint that isolates instances independently. Our method, TrajLoc, achieves this directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token. Evaluations across six datasets, featuring up to 20 simultaneously controlled objects and out of distribution real world scenes, demonstrate that our method consistently improves both visual fidelity and trajectory adherence. Applied to two architecturally distinct backbones (CogVideoX 5B and WaN 2.1 14B), our approach achieves average gains of +4.3 dB PSNR and a 51% reduction in trajectory end point error compared to the strongest baselines. Project page: this https URL

42. 【2607.00858】MoVA: Learning Asymmetric Dual Projections for Modular Long Video-Text Alignment

链接https://arxiv.org/abs/2607.00858

作者:Peiyuan Zhu,Shaoan Xie,Zijian Li,Yifan Shen,Namrata Deka,Harsh Shrivastava,Guangyi Chen,Kun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Contrastive pre-training, predecessors like CLIP, propelled video-text alignment, resulting in entangled, pre-training has propelled

备注: ECCV 2026

点击查看摘要

Abstract:Contrastive pre-training has propelled video-text alignment, yet models often inherit the critical limitations of their image-text predecessors like CLIP, resulting in entangled representations. These challenges are severely exacerbated by two fundamental properties in the video domain: Temporal Misalignment, where textual descriptions often correlate only to specific, constrained temporal windows, leaving other frames text-irrelevant; and Semantic Asymmetry, which dictates a sparse, bidirectional, and non-equivalent relevance between frame-level visual details and caption-level concepts. This failure persists whether captions are short and temporally disjoint, creating ambiguity, or long and detailed, fostering entanglement between static objects and their temporal evolution. In this paper, we establish theoretical conditions that enable flexible alignment between video and text representations across the temporal dimension and at varying levels of granularity. Building on these theoretical insights, we introduce MoVA, Modular Long Video-Text Alignment, which learns dual asymmetric projections: a text-side projection that adaptively selects frame-aware subspaces of the caption, and a video-side projection that disentangles text-relevant visual concepts. Our framework ensures that the model can preserve global cross-modal semantics while disentangling evolving, frame-specific concepts and scale naturally to long captions and videos. Empirical evaluations show that MoVA outperforms existing methods in multiple video-text alignment tasks, demonstrating the effectiveness of our method.

43. 【2607.00850】Mirror-Fusion Attention for Reflection-Aware Self-Supervised Representation Learning

链接https://arxiv.org/abs/2607.00850

作者:Ruixin Li,Jin Liu,Yuling Shi,Stefano Lodi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:methods encourage invariance, suppress informative left, strict flip invariance, self-supervised learning, approximately bilateral data

备注: Accepted at ECML PKDD 2026. The final authenticated version will be available in the Springer LNCS proceedings

点击查看摘要

Abstract:Most self-supervised learning (SSL) methods encourage invariance across augmentations, but strict flip invariance can suppress informative left--right correspondences in approximately bilateral data such as medical images and human faces. We propose Mirror-Fusion-Augmented Self-Supervised Learning (MFASSL), a Vision Transformer framework that injects a soft reflection prior into standard SSL without redesigning the backbone. MFASSL constructs mirror-paired views aligned to an estimated symmetry axis and introduces a lightweight Mirror-Fusion Attention (MFA) module for adaptive token-level interaction between mirrored regions while preserving asymmetric cues. The base SSL objective is further coupled with reflection-consistency and mid-layer token-alignment losses. Across CheXpert, BraTS, CelebA-HQ, and WFLW, MFASSL improves downstream performance, calibration, and reflection robustness over MoCo-v3, DINO, and MAE baselines under matched ViT-B/16 settings. It also achieves stronger and more consistent gains than recent equivariant SSL approaches with only approximately 2.7\% additional parameters. These results show that lightweight geometry-aware priors can effectively complement invariance-based SSL.

44. 【2607.00839】Rethinking Multi-Label Image Classification With Deep Learning: Taxonomy, Challenge, and Outlook

链接https://arxiv.org/abs/2607.00839

作者:Xuelin Zhu,Xiu-Shen Wei,Jiawei Ge,Shuai Xu,Bing Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-label image classification, underpinning numerous read-world, mobile service robot, identifying multiple objects, Multi-label image

备注

点击查看摘要

Abstract:Multi-label image classification (MLIC), a fundamental task in computer vision, focuses on identifying multiple objects or concepts within an image, underpinning numerous read-world applications, such as autonomous driving, disease diagnosis, recommendation system, and mobile service robot. Over the past decade, deep learning paradigms based on convolutional neural networks, recurrent neural networks, and Transformers have significantly advanced this field, owing to their powerful capability in visual representation and relationship modeling. These advances have markedly improved the robustness, scalability, and generalization ability of MLIC models across diverse datasets and application domains. In this survey, we provide a comprehensive review of the deep learning-based literature on MLIC. Concretely, we first revisit the background, including problem definition, datasets, backbones and evaluation metrics. Next, we develop a plausible taxonomy for the deep learning-based MLIC approaches, organizing them into six groups: region-oriented methods, label-oriented methods, architecture-oriented methods, representation-oriented methods, learning-oriented methods, and data-oriented methods. Finally, we provide an insightful exposition of the underlying learning game in MLIC and its implications for other vision domains, and we empirically summarize the key challenges and research directions in MLIC while outlining promising avenues for future development. We believe this survey offers the research community a holistic and systematic perspective on MLIC, thereby facilitating subsequent exploration and innovation in this field and beyond.

45. 【2607.00832】Pano2World: End-to-End 3D Generation via Unified Multi-View Sequences

链接https://arxiv.org/abs/2607.00832

作者:Zhenjia Li,Jinrang Jia,Yifeng Shi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:full visual sphere, true scene exploration, enabling true scene, single panorama captures, camera center

备注: 10 pages, 3 figures, 3 tables. Preprint

点击查看摘要

Abstract:A single panorama captures the full visual sphere from one camera center, yet confines users to looking around in place without enabling true scene exploration. Converting a single panorama into a persistent, renderable 3D representation for free-viewpoint navigation has attracted growing interest; existing methods either adopt iterative per-view completion that propagates inpainting results to update the underlying geometry, leading to progressive error accumulation and cumbersome multi-step pipelines, or leverage the temporal consistency priors of video generation models, yet the continuous-trajectory constraint intrinsic to such models limits their flexibility in covering scenes from multiple directions simultaneously. We present Pano2World, which takes a single indoor panorama as input and directly outputs a persistent, explorable 3D Gaussian scene. Given the source panorama, Pano2World first reconstructs a coarse 3D Gaussian proxy and renders it at adaptively sampled nearby poses to obtain geometrically aligned guidance panoramas; a panoramic diffusion model then jointly denoises all target views via View-Aware Attention Routing, where each target view simultaneously receives geometric constraints from its corresponding guidance panorama and global semantic guidance from the source panorama, naturally enforcing cross-view consistency. To avoid the information loss incurred by decoding the multi-view hidden features formed during joint denoising back to the pixel domain via VAE, we introduce Latent Feature Adapter, a geometry-aware bridge module that directly distills these hidden features into a scene latent, subsequently decoded into the final 3D Gaussian scene. Experiments demonstrate that Pano2World significantly outperforms existing methods on the multi-position panoramic novel-view synthesis benchmark.

46. 【2607.00829】Stitched Embeddings: A Unified Latent Space for 3D Garments and 2D Patterns

链接https://arxiv.org/abs/2607.00829

作者:Andrea Sanchietti,Riccardo Marin,Bharat Lal Bhatnagar,Yuanlu Xu,Gerard Pons-Moll

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:realistic digital humans, topological variety makes, digital humans, parametric bodies, essential for realistic

备注

点击查看摘要

Abstract:While garments are essential for realistic digital humans, their topological variety makes them much harder to model than parametric bodies. Traditional tailoring relies on 2D sewing patterns, yet bridging these patterns to 3D geometry currently requires physical simulations. We present Stitched Embeddings, the first simulation-free framework to unify 3D garment reconstruction and sewing pattern inference within a single bidirectional latent space. By leveraging the geometric priors of a pretrained 3D foundation model, our approach overcomes the data scarcity typically associated with high-quality garment modeling. We propose to use the BoxMesh as a critical intermediate representation to align 2D panels into 3D configurations without the computational overhead of a simulator. This architecture achieves state-of-the-art accuracy in pattern reconstruction while significantly improving efficiency. Furthermore, our differentiable pipeline enables novel applications, including pattern recovery from meshes and 3D editing from 2D patterns. Finally, this work provides a scalable link between neural 3D vision and the physical garment manufacturing pipeline. Project Page: this https URL

47. 【2607.00817】raining-Free Debiasing of Diffusion Models via CLIP-Guided Denoising Optimization

链接https://arxiv.org/abs/2607.00817

作者:Dain Kim,Jinseo Kim,Sungyong Baik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieve impressive visual, neutral prompts consistently, prompts consistently produce, consistently produce stereotypical, produce stereotypical representations

备注

点击查看摘要

Abstract:Text-to-image diffusion models achieve impressive visual quality, yet demographic bias remains a challenge, as neutral prompts consistently produce stereotypical representations across gender and race. Existing approaches remain limited by costly retraining or by inference-time interventions that often degrade image quality and semantic alignment. We propose Text Embedding Steering (TES), a training-free framework that mitigates demographic bias by directly optimizing conditional text embeddings during the diffusion process. We show that a two-stage strategy - early-stage global alignment followed by iterative denoising-time refinement with CLIP-based feedback - enables stable and controllable attribute steering without modifying model parameters. Extensive experiments on Stable Diffusion demonstrate that TES outperforms existing training-free baselines in fairness while maintaining competitive image quality. These results highlight that inference-time text embedding optimization is a practical and scalable solution for fairness-aware generation in diffusion models.

48. 【2607.00816】owards High-Resolution Visual Perception via Hierarchical Entity Exploration

链接https://arxiv.org/abs/2607.00816

作者:Ziyu Ma,Shidong Yang,Yuxiang Ji,Yiming Hu,Tongwen Huang,Yong Wang,Jianfei Cai,Xiangxiang Chu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, large language models, Hierarchical Entity Exploration, remains a key, key challenge

备注: Accepted by ECCV2026

点击查看摘要

Abstract:High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs), as fine-grained details are often lost when the image is processed as a whole. Existing methods either require training to teach models where to look or heuristically divide the image into fixed regions, both of which struggle to generalize in complex HR scenes. In this work, we propose Hierarchical Entity Exploration (HEE), a training-free and model-agnostic framework that transforms static image understanding into dynamic, query-guided entity exploration. HEE first evaluates each region using a dual scoring mechanism to determine whether it already contains sufficient evidence to answer the question. If not, it applies object detection within the most promising region to extract fine-grained entities, clusters them into coherent subregions, and organizes them into a multi-level semantic hierarchy for deeper exploration. When deeper regions still fail to yield confident answers, a confidence-guided backtracking mechanism revisits alternative paths to ensure adaptive perception. Extensive results show that HEE outperforms training-free methods like ZoomEye and RAP in both accuracy and efficiency on two complex HR benchmarks (Visual Probe and HR-Bench), across different MLLMs such as Qwen2.5-VL and LLaVA-OneVision. Moreover, HEE demonstrates generalization on the MME-RealWorld benchmark.

49. 【2607.00804】Spotted: Location-informed Reidentification of Hyenas and Leopards in Camera Trap Surveys

链接https://arxiv.org/abs/2607.00804

作者:Halil Sina Kelebek,Julia Hindel,Kobus Hoffman,Lauren Hoffman,Andrew Loveridge,Bob Mandinyenya,Kudakwashe Ncube,Justin Seymour-Smith,Andrea Sibanda,Abhinav Valada,Matthew Wijers,Daniele De Martini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low image quality, camera-trap surveys remains, highly imbalanced numbers, surveys remains challenging, remains challenging due

备注

点击查看摘要

Abstract:Animal re-identification (ReID) in camera-trap surveys remains challenging due to low image quality, strong variation in illumination and viewpoint, and highly imbalanced numbers of observations per individual. As a result, current ReID performance is often insufficient for fully automated use, and practical workflows typically depend on expert review of algorithmically proposed candidate matches. Moreover, most existing approaches focus almost exclusively on visual cues and overlook auxiliary information routinely available in field studies, such as image timestamps and camera-trap locations. We introduce Spotted, a location-informed, human-in-the-loop animal ReID framework that integrates visual similarity with spatio-temporal feasibility priors derived from camera locations, thereby reducing the amount of required expert review. Our method (i) computes an image-model-agnostic feasibility score based on the minimum travel speed required for two detections to correspond to the same individual, (ii) uses these feasibility cues as pseudo-supervision to train a lightweight head on top of a frozen visual foundation model, and (iii) fuses adapted visual similarity with spatio-temporal feasibility to obtain a robust pairwise matching score. We additionally integrate an active pair sampling strategy to accelerate annotation by initially prioritizing uncertain predictions. We evaluate Spotted on three challenging camera-trap ReID datasets comprised of spotted hyenas and leopards, which we release as part of this work. Our model improves average top-5 identification accuracy by 9pp, 2pp and 9pp over the best baseline on our LeopardID102, SpottedHyenaID109 and SpottedHyenaID415 datasets, respectively. Further, we show that our human-in-the-loop strategy reduces the number of queried comparisons by up to 69pp while achieving equivalent positive matches.

50. 【2607.00798】ClinRAG-GRAPH: Clinical-prior Retrieval-Augmented Graph Model with Domain Adversarial Learning for Breast pCR Prediction

链接https://arxiv.org/abs/2607.00798

作者:Yaofei Duan,Yuhao Huang,Tianyu Zhang,Yuan Gao,Luyi Han,Xin Wang,Xinyu Xie,Xinglong Liang,Chunyao Lu,Muzhen He,Patrick Pang,Yue Sun,Ning Mao,Tao Tan,Ritse Mann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neoadjuvant chemotherapy, important for treatment, treatment stratification, pre-treatment pCR prediction, Neoadjuvant

备注: 11 pages, 5 figures

点击查看摘要

Abstract:Neoadjuvant chemotherapy (NAC) response prediction is clinically important for treatment stratification in breast cancer. However, robust pre-treatment pathological complete response (pCR) prediction remains challenging due to insufficient cross-modal modeling, multicenter imaging heterogeneity, and weak evidence-grounded interpretability. We propose ClinRAG-GRAPH, a Clinically informed Retrieval-Augmented Generation Graph framework, for pre-treatment pCR prediction from DCE-MRI, structured clinical variables, and biopsy-derived pathological biomarkers. ClinRAG-GRAPH constructs an intra-patient clinical-prior graph and applies a prior-guided relation-aware graph convolutional network for structured multimodal representation learning. To improve cross-center robustness, we introduce a dual-branch domain-adversarial learning strategy to suppress protocol-related MRI bias while preserving pCR-relevant features. To enhance interpretability, we further incorporate large language model (LLM)-driven subgraph RAG module that retrieves clinically analogous historical cases and integrates retrieved evidence for pCR inference. We assemble a large-scale multicenter NAC breast cancer cohort for extensive validation, drawing from two public sources and three in-house this http URL show that ClinRAG-GRAPH achieves AUCs of 0.815 on the internal test set and 0.774/0.712 on two external test sets, demonstrating robust pre-treatment pCR prediction across centers. The code is available at the anonymized this https URL.

51. 【2607.00784】LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

链接https://arxiv.org/abs/2607.00784

作者:Lukas Kuhn,Giuseppe Serra,Randall Balestriero,Florian Buettner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vision-only self-supervised learning, pretraining remains dominated, largely adopted non-contrastive, Vision-language pretraining remains, adopted non-contrastive methods

备注

点击查看摘要

Abstract:Vision-language pretraining remains dominated by contrastive objectives, whereas vision-only self-supervised learning has largely adopted non-contrastive methods. At the same time, the role of vision-language encoders has shifted: they are increasingly deployed not as zero-shot classifiers but as the frozen visual backbone of vision-language models and dense prediction systems, which consume the full grid of patch tokens rather than a single pooled embedding. We introduce LeVLJEPA, the first fully non-contrastive end-to-end vision-language pretraining method. LeVLJEPA learns through cross-modal prediction with stop-gradient targets and per-modality distributional regularization, without negatives, temperature, momentum encoder, or teacher-student schedule, and trains stably at large scale. We find that the resulting encoder provides markedly stronger dense semantic features for downstream use: as a frozen vision-language-model backbone, LeVLJEPA is the strongest of the evaluated encoders across GQA, VQAv2, and POPE under two distinct language models, and outperforms contrastive baselines on semantic segmentation, while remaining on par on global readouts such as linear probing. These results establish non-contrastive pretraining as an effective means of producing dense semantic vision features.

52. 【2607.00780】SpiralFovea: Input-Adaptive Foveated Tokenization as a Third Lever of Resource-Adaptive Inference

链接https://arxiv.org/abs/2607.00780

作者:Kyan Mahajan,Mohammad Saqlain

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic attention sparsity, early exit, MoE routing, KV-cache compression, dynamic attention

备注

点击查看摘要

Abstract:Most adaptive-inference techniques for foundation models change what the model does - early exit, MoE routing, KV-cache compression, dynamic attention sparsity. The input that hits the backbone, however, remains a fixed-grid tokenisation indifferent to image content. We argue that this is a missed lever. We present SpiralFovea, a parameter-free, input-adaptive tokeniser in which token identity, location, scale, and count are all functions of local visual entropy and selection completes before any backbone parameter is queried. Around content-driven hotspot anchors, multi-scale spiral rings produce = 78 patches that replace the standard 196-patch ViT grid at the input stage. Across four canonical fine-grained benchmarks, SpiralFovea yields +1.7-2.1 pp accuracy with a 60% reduction in input tokens, an 84% reduction in self-attention FLOPs at every transformer layer, and 18-29% throughput gains over the matched static tokenisation baseline. A controlled ablation on CUB-200-2011 Genus across four backbones reveals a clean diagnostic: the gain magnitude tracks inversely with the strength of the backbone's whole-image positional prior, isolating self-supervised foundation models as the regime where input-adaptive tokenisation is most valuable.

53. 【2607.00774】Soft Mixture-of-Recursions: Going Deeper with Recursive Vision Transformers

链接https://arxiv.org/abs/2607.00774

作者:Sang In Lee,Jihun Park

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent recursive Transformer, primarily reused shared, recursive Transformer studies, Recent recursive, Vision Transformers

备注: 16 pages, 8 figures

点击查看摘要

Abstract:Recent recursive Transformer studies have primarily reused shared parameters across computation steps to construct compact, parameter-efficient models. In this work, we leverage recursion to build effectively deeper Transformers with stronger representational capacity. However, in Vision Transformers, simply increasing recursion depth does not reliably improve performance, as existing recursive approaches do not fully utilize the intermediate representations produced throughout recursive computation. We propose Soft Mixture-of-Recursions (SoftMoR) and its Vision Transformer instantiation, Soft Recursive Vision Transformer (SR-ViT). SoftMoR learns token-wise mixture weights to softly combine outputs from all recursion steps, allowing intermediate representations to be utilized in a learnable and flexible way. Across diverse vision tasks, SR-ViT consistently improves as recursion depth increases with minimal parameter overhead. On ImageNet-1K, increasing recursion depth from 1 to 4 improves SR-ViT-S top-1 accuracy from 79.83% to 82.48% with only 1.7M additional parameters, outperforming the substantially larger DeiT-B while using approximately 27% of its parameters. These results demonstrate that SoftMoR provides a parameter-efficient path to deeper and stronger Vision Transformers through recursion.

54. 【2607.00766】Decoupled Guidance: Disentangling Subject and Context Pathways in Text-to-Image Personalization

链接https://arxiv.org/abs/2607.00766

作者:Seongmin Kim,Kyucheol Shin,Heesun Jung,Jinseo Kim,Sungyong Baik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to generate, generate a user-provided, subject, user-provided subject, Abstract

备注

点击查看摘要

Abstract:Text-to-image personalization aims to generate a user-provided subject in novel scenes described by text. However, most existing methods encode subject identity (fidelity) and context (editability) through the same conditioning pathway, forcing the two to compete for attention-map resources. We refer to this phenomenon as conditioning entanglement and show that it induces a fidelity-editability trade-off. We further provide causal evidence by replacing the target subject token with a generic subject token, which produces shifts in attention allocation and corresponding changes in context adherence. To this end, we propose Decoupled Guidance (DeGu), a plug-and-play framework that routes subject identity and scene context through two independent guidance streams. We further introduce a spatial mixing mechanism that dynamically fuses these streams, ensuring each operates within its semantically relevant region without interference. Furthermore, DeGu can be readily applied to existing personalization methods without modifying the underlying backbone models, consistently improving the overall personalization performance while enabling inference-time control over the fidelity-editability balance, across diverse methods and backbones, including flow-matching Diffusion Transformers (DiTs).

55. 【2607.00752】GKDT: General Keypoint Detection Transformer

链接https://arxiv.org/abs/2607.00752

作者:Changsheng Lu,Yuxin Chen,Haokun Gui,Rong Wang,Jie Yang,Harry Yang,Anton van den Hengel,Jiaya Jia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:general keypoint detection, computer vision, open-domain recognition, pre-trained vision, shifting from narrow-domain

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:With the emergence of various pre-trained vision and language models, computer vision is shifting from narrow-domain to open-domain recognition. The construction of a more powerful yet general keypoint detection (GKD) model to support diverse tasks has become increasingly important in the field. To this end, we firstly present a large-scale unified keypoint dataset called MegaKPT. The dataset is composed of over 1.3 million diverse object instances from twenty-nine existing datasets, and enjoys high-quality unified annotations with keypoint text descriptions. Based on MegaKPT, we develop GKDT, a simple, flexible and powerful DINOv3 based Transformer model for General Keypoint Detection. Our GKDT supports visual prompts, text prompts, or both. To enhance model training, we also propose a suite of useful strategies such as mix-modal prompted training and dynamic importance sampling. By testing over 22 test sets with seen or unseen objects, our single GKDT model shows strong performance and generality in detecting keypoints on broad categories, with most categories over 90\% PCK@0.1 accuracy, offering high practical applicability to real-world problems. The dataset, models, and codes will be released at this https URL.

56. 【2607.00748】FrameONE: Hierarchical Motion Modeling for Universal Multi-View Echocardiographic Keyframe Detection

链接https://arxiv.org/abs/2607.00748

作者:Rusi Chen,Yuhao Huang,Hongyuan Zhang,Chao Tian,Shunan Ji,Yuhan Zhang,Dong Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate detection, frames is fundamental, Accurate, echocardiographic assessment, motion

备注: Accepted by MICCAI 2026. 10 pages, 4 figures

点击查看摘要

Abstract:Accurate detection of end-systole (ES) and end-diastole (ED) frames is fundamental to echocardiographic assessment. Existing methods are typically developed in a view-specific manner, depend on auxiliary annotations or intensive visual modeling, which limits their generalizability. In multi-view modeling, keyframe detection is driven by shared cardiac motion, yet large appearance differences and motion patterns make unified modeling challenging. To address these issues, we propose FrameONE, a unified end-to-end framework for multi-view echocardiographic keyframe detection. FrameONE introduces a Hierarchical Motion Modeling strategy: an intra-view multi-task learning reduces appearance bias and promotes motion-focused representations within each view; an inter-view general motion learning module further separates view-agnostic dynamics from view-specific patterns, enabling shared yet flexible motion representation learning across views. Extensive experiments on 25,872 videos spanning four standard views demonstrate that FrameONE achieves state-of-the-art keyframe detection accuracy with strong cross-view generalization. Code is available at this https URL.

57. 【2607.00747】Active Learning for Cascaded Object Detection: Balancing Coverage and Uncertainty in Table Extraction Pipelines

链接https://arxiv.org/abs/2607.00747

作者:Eliott Thomas,Mickael Coustaty,Aurelie Joseph,Gaspar Deloin,Vincent Poulain d'Andecy,Jean-Marc Ogier

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Table Structure Recognition, Structure Recognition, business documents relies, internal layout, recovers their internal

备注: Accepted at ICDAR 2026

点击查看摘要

Abstract:Table extraction from business documents relies on a cascaded pipeline where Table Detection (TD) first localizes tables and Table Structure Recognition (TSR) then recovers their internal layout. Building task-specific training sets for this pipeline is costly, particularly for TSR which requires fine-grained structural annotations. Active learning (AL) can reduce this annotation burden, yet most AL strategies are designed for single-model tasks and do not account for inter-stage dependencies in cascaded architectures. In this work, we present the first adaptation of Uncertainty Herding (UHerding), a hybrid coverage-uncertainty sampling method originally proposed for image classification, to cascaded object detection pipelines. We propose two pipeline-aware extensions that exploit the TD-to-TSR dependency: RankFusion adds dual-manifold coverage over both detection and structure representation spaces, while CAPA further incorporates stage-dependent gating and per-task uncertainty calibration. Extensive experiments across two public (PubTables-1M and FinTabNet) and two private table extraction datasets, with various annotation budgets (from 71 to 500 documents) show that UHerding generalizes well to table extraction, outperforming each baseline. Among pipeline-aware variants, RankFusion achieves higher expected gains but at the cost of greater variance, while CAPA emerges as the most consistent strategy, outperforming standard UHerding on three out of four datasets.

58. 【2607.00746】GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception

链接https://arxiv.org/abs/2607.00746

作者:Xiao Zhao,Chang Liu,Mingxu Zhu,Zheyuan Zhang,Linna Song,Qingliang Luo,Chufan Guo,Kuifeng Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enables multi-sensor features, representation enables multi-sensor, enables multi-sensor, Gaussian, BEV

备注: ICLR 2026

点击查看摘要

Abstract:The bird's-eye view (BEV) representation enables multi-sensor features to be fused within a unified space, serving as the primary approach for achieving comprehensive 3D perception. However, the discrete grid representation of BEV leads to significant detail loss and limits feature alignment and cross-modal information interaction in multimodal fusion perception. In this work, we break from the conventional BEV paradigm and propose a new universal framework for multi-modal fusion based on 3D Gaussian representation. This approach naturally unifies multi-modal features within a shared and continuous 3D Gaussian space, effectively preserving edge and fine texture details. To achieve this, we design a novel forward-projection-based multi-modal Gaussian initialization module and a shared cross-modal Gaussian encoder that iteratively updates Gaussian properties based on an attention mechanism. GaussianFusion is inherently a task-agnostic model, with its unified Gaussian representation naturally supporting various 3D perception tasks. Extensive experiments demonstrate the generality and robustness of GaussianFusion. On the nuScenes dataset, it outperforms the 3D object detection baseline BEVFusion by 2.6 NDS. Its variant surpasses GaussFormer on 3D semantic occupancy with 1.55 mIoU improvement while using only 30% of the Gaussians and achieving a 450% speedup.

59. 【2607.00745】Foundation Model-driven Key Anatomy Frame Selection for Blind-sweep Ultrasound Fetal Birth Weight Estimation

链接https://arxiv.org/abs/2607.00745

作者:Le Ou,Xiliang Zhu,Huanwen Liang,Wenxiong Pan,Yuhao Huang,Yuxiang Deng,Xuan Sheng,Hong Yin,Juhua Xiao,Xin Zhou,Dong Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fetal birth weight, Accurate fetal birth, birth weight, operator expertise, low-resource settings

备注: Accepted by MICCAI 2026. 10 pages, 2 figures. Code: [this https URL](https://github.com/ouleoule/BlindSweep-EBW)

点击查看摘要

Abstract:Accurate fetal birth weight (FBW) estimation shortly before delivery is clinically valuable yet challenging due to its reliance on operator expertise, particularly in low-resource settings. To reduce this reliance, we study near-term birth-weight regression from blind-sweep ultrasound (US) videos acquired within 48 hours prior to delivery, with post-delivery weighing as ground truth. Accordingly, we propose a foundation model-driven key anatomy frame selection framework that enables accurate FBW regression despite the absence of plane constraints in blind sweeps. Our highlights are as follows: (1) We believe this is the first work to estimate FBW using blind-sweep US videos, enabling operator-independent assessment. (2) An Anatomy-Guided Frame Selection module equipped with a vision-language foundation model is proposed for keyframe collection in unconstrained sweeps. (3) A Redundancy-Aware Feature Compression module is designed to compress frame features while preserving task-relevant information, alleviating temporal redundancy. Extensively validated on prospectively collected data from 839 patients, our method achieves an MAE of 161.3 g, with 90.23% and 100% of cases falling within 10% and 15% absolute percentage error, outperforming typical Hadlock estimation and strong competitors. Codes are available at this https URL.

60. 【2607.00744】Prototype Memory-Guided Training-Free Anomaly Classification and Localization in Prenatal Ultrasound

链接https://arxiv.org/abs/2607.00744

作者:Huanwen Liang,Yuhao Huang,Xiliang Zhu,Yuanji Zhang,Xuedong Deng,Xinru Gao,Guowei Tao,Yuhan Zhang,Dong Ni

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pregnancy management, critical importance, importance for fetal, fetal health, health and pregnancy

备注: Accepted by MICCAI2026

点击查看摘要

Abstract:Prenatal anomaly classification and localization is of critical importance for fetal health and pregnancy management. Although ultrasound (US) is the primary modality for prenatal screening, accurate diagnosis remains challenging due to the low prevalence and high heterogeneity of anomalies. Existing deep learning methods for prenatal tasks rely on large-scale annotated datasets, which are difficult to obtain in practice. Although few-shot learning alleviates data scarcity, it typically requires fine-tuning for new categories, limiting its practicality in resource-limited clinical settings. To address these challenges, we propose a training-free framework for multi-class prenatal US anomaly classification and localization that operates with only a few reference images per class, representing the first exploration of this setting. Our framework comprises three key components: (1) a memory bank with multi-granular prototypes that explicitly models both class-level semantics and anomaly characteristics; (2) a prototype-driven soft merging mechanism that aggregates discriminative features to detect the anomaly region; and (3) a class-aware refinement strategy that leverages prototype consistency to improve category prediction. Extensively validated on a multi-center prenatal US dataset containing 1,149 cases, with a total of 2,357 images and 9 categories, our proposed method outperforms the competitors.

61. 【2607.00736】owards Robust Driving Perception: A Flexible Scale-Driven Family for Self-Supervised Monocular Depth Estimation

链接https://arxiv.org/abs/2607.00736

作者:Zhaowen Zhu,Li Zhang,Yujie Chen,Tian Zhang,Yingjie Wang,Mingxia Zhan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monocular Depth Estimation, recent years due, Self-Supervised Monocular Depth, Depth Estimation, ground truth

备注: Accepted by ECCV2026. Code is available at [this https URL](https://github.com/startnew/flexdepth)

点击查看摘要

Abstract:Self-Supervised Monocular Depth Estimation (MDE) has garnered attention in recent years due to its independence from ground truth. However, most existing models are limited to a single scale and exhibit considerable performance degradation in complex driving environments. Networks specifically designed to handle dynamic traffic participants tend to be overly complex, hindering their deployment on resource-constrained automotive edge devices. To address these limitations and move towards robust driving perception, we propose FlexDepth, a scale-driven and flexible family of self-supervised MDE models tailored for challenging road scenarios. FlexDepth employs a two-stage static-dynamic decoupled training strategy, enabling the independent assessment of confidence for both static backgrounds and dynamic road objects. Furthermore, it introduces a meticulously designed Scale-Driven Decoder (SDD) to dynamically select components based on scale size, facilitating efficient feature fusion and the output of high-precision depth maps. Extensive experiments on standard driving benchmarks demonstrate that without any auxiliary information, our model achieves state-of-the-art performance across arbitrary scales with minimal computational overhead. Our smallest model, Flex-Nano, requires only 0.7 GFLOPs and achieves 37.6 FPS on mobile platforms, ensuring reliable real-time perception while maintaining excellent zero-shot this http URL source code is avalible: this https URL

62. 【2607.00734】ConRTF: Edge-Constrained Boundary Distribution Refinement for Realtime TransFormer Table Structure Recognition

链接https://arxiv.org/abs/2607.00734

作者:Eliott Thomas,Tri-Cong Pham,Mickael Coustaty,Aurelie Joseph,Gaspar Deloin,Vincent Poulain d'Andecy,Jean-Marc Ogier,Antoine Doucet

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Table Structure Recognition, Structure Recognition, document understanding pipelines, document images, Table Structure

备注: Accepted to ICDAR 2026

点击查看摘要

Abstract:Table Structure Recognition (TSR) aims to recover the row and column layout of tables from document images, a key step in document understanding pipelines. Accurate TSR depends on precise boundary localization: small errors in row or column boundaries can propagate into incorrect cell assignments and structural inconsistencies. Yet detection-based approaches treat table elements as generic objects, ignoring a fundamental property of table layout: rows and columns play structurally distinct roles and their boundaries carry unequal importance. We propose an Edge-constrained Fine-grained Localization loss (EFL) that formalizes this structural asymmetry by encoding table-specific geometric priors into the training objective: row-like elements are supervised with emphasis on their horizontal boundaries, while column-like elements prioritize vertical boundaries. Implemented within a real-time detector with distribution-based boundary refinement (D-FINE), EFL operates during training only and guides boundary refinement toward structurally meaningful adjustments with no change to the inference pipeline. The proposed approach, ConRTF, is also data-efficient, maintaining robust accuracy with as few as 2k--3k annotated tables. Experiments on PubTables-1M and two private datasets show consistent improvements over the optimized baseline and several real-time detectors including RT-DETRv2 and YOLOv10-11, with gains of up to +1.6 GriTS points at equal inference speed.

63. 【2607.00726】AV-SyncBench: Decoupled Benchmarking of Temporal and Semantic Audio-Visual Synchronization

链接https://arxiv.org/abs/2607.00726

作者:Tianhong Zhou,Mingyang Han,Boyu Li,Yuxuan Jiang,Jiaxin Ye,Dongxiao Wang,Haoxiang Shi,Kunpeng Wang,Jun Song,Cheng Yu,Bo Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Audio-visual feature extraction, fundamental component, component of multimodal, multimodal understanding, understanding and generation

备注: Accepted by Interspeech 2026

点击查看摘要

Abstract:Audio-visual feature extraction is a fundamental component of multimodal understanding and generation tasks. However, existing evaluation protocols for feature extraction models exhibit dimensional bias, typically focusing on either semantic matching or temporal offset detection. Moreover, their data construction remains coupled, preventing independent assessment of temporal and semantic consistency. We propose AV-SyncBench, the first benchmark to fully separate temporal and semantic evaluation for audio-visual synchronization. Built from in-the-wild videos, it spans Voice, Music, and Sound across 10 scenarios and 5 challenge tasks. Data are automatically filtered and manually verified to ensure on-screen sound sources. The benchmark contains 3,269 videos and 38,390 samples, and we evaluate five representative models to quantify feature quality for alignment and downstream tasks. The code and dataset are available at: this https URL.

64. 【2607.00716】Partial Skeleton Visibility for Action Recognition: A Constrained Field-of-View Approach

链接https://arxiv.org/abs/2607.00716

作者:Yingjie Dai,Tianyang Xu,Yanglin Deng,Xiao-Jun Wu,Josef Kittler

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved remarkable success, prevailing methods overwhelmingly, methods overwhelmingly assume, overwhelmingly assume complete, clean skeleton inputs

备注: 18 pages, 4 figures

点击查看摘要

Abstract:Skeleton-based action recognition has achieved remarkable success by exploiting joint coordinates and their topological connections, yet prevailing methods overwhelmingly assume complete and clean skeleton inputs. In real-world deployments, such as egocentric vision, crowded surveillance, wearable devices, or edge robotics, limited field-of-view (FoV) frequently causes substantial joint visibility dropout, leading to severe performance degradation that existing models are largely unprepared to handle. To bridge this critical yet underexplored gap, we introduce PartialVisGraph, a novel hypergraph framework tailored for robust skeleton action recognition under constrained FoV. We first construct highly expressive hypergraphs by introducing learnable virtual hyperedges that form a soft incidence matrix, capturing flexible high-order dependencies beyond conventional pairwise graphs. We then propose the Single-Head Sample-Adaptive Transformer, which adaptively aggregates joint features onto hyperedges while explicitly incorporating a visibility prior. This prior selectively gates information flow, preventing occluded or out-of-view joints from corrupting reliable feature propagation. We further establish rigorous evaluation protocols with realistic FoV simulation benchmarks on NTU RGB+D 60 and 120. Extensive experiments demonstrate that PartialVisGraph consistently achieves state-of-the-art accuracy under partial visibility, with gains of up to 68.8\% on subsets with severe FoV restrictions compared to recent strong baselines, while remaining superior on full-visibility settings. Our approach offers a principled and practical pathway toward deployable skeleton-based action understanding in unconstrained environments.

65. 【2607.00712】owards Memory-Efficient Autoregressive Video Generation via Instance-Specific Parametric Absorption

链接https://arxiv.org/abs/2607.00712

作者:Xiaomeng Fu,Jia Li,Yiming Hu,Yong Wang,Hayden Kwok-Hay So,Jiao Dai,Xiangxiang Chu,Jizhong Han

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:long video generation, video generation, powerful paradigm, paradigm for long, long video

备注: ECCV 2026 Camera Ready

点击查看摘要

Abstract:Autoregressive (AR) streaming models have emerged as a powerful paradigm for long video generation. However, the linearly growing Key-Value (KV) cache poses a significant bottleneck, leading to memory overload and degraded inference throughput. A common compression method is to drop redundant KV tokens, which often breaks long-range dependencies, resulting in temporal flickering and identity loss. In this paper, we propose Instance-Specific Parametric Absorption (ISPA), a novel framework that shifts the KV cache compression from discarding to distilling. The core idea is to transit a subset of layers from Full-Attention (F-Layers) to memory-efficient Local-Attention (L-Layers) by "absorbing" historical context into the model's weights. Specifically, during a brief warmup phase, ISPA monitors the output discrepancy between global and local attention. At the transition point, we solve a closed-form least-squares problem to compute an instance-specific weight modulation that compensates for the missing history. Experiments across architectures (1.3B to 14B) demonstrate that ISPA can remove up to 50\% of the KV cache with near-lossless visual quality. We hope this perspective encourages future work to explore parametric memory consolidation beyond external token-level cache management for streaming generative models.

66. 【2607.00710】Creating Impactful Autonomous Driving Datasets: A Strategic Guide from Research Gap to Benchmark

链接https://arxiv.org/abs/2607.00710

作者:Richard Schwarzkopf,Jonas Merkert,Frank Bieder,Annika Bätz,Alexander Blumberg,Carlos Fernandez,Felix Hauser,Fabian Immel,Christian Kinzig,Hendrik Königshof,Fabian Konstantinidis,Martin Lauer,Willi Poh,Nils Rack,Kevin Rösch,Yinzhe Shen,Marlon Steiner,Gleb Stepanov,Dominik Strutz,Ömer Şahin Taş,Julian Truetsch,Kaiwen Wang,Royden Wagner,Jan-Hendrik Pauls,Christoph Stiller

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:existing literature primarily, literature primarily describes, Well-designed autonomous driving, shaped research progress, fundamentally shaped research

备注: Keywords: Autonomous Driving, Dataset Design, Benchmarks, Research Gap Identification. 14 pages, 3 figures

点击查看摘要

Abstract:Well-designed autonomous driving datasets have fundamentally shaped research progress, yet existing literature primarily describes what datasets contain rather than how to strategically design impactful ones. This is especially limiting for small and medium-sized labs and startups that cannot afford to misallocate scarce resources. We argue that impactful dataset creation begins with a diagnosis: whether a research question is blocked by a data problem or an evaluation problem, and proceeds by selecting the minimal data operator(s) that closes the resulting gap, recording new data only when no cheaper operator(s) suffices. We analyze the evolution of major autonomous driving (AD) datasets through this lens and distill a strategic framework spanning gap identification, operator choice, sensor suite design, and annotation strategy. We ground the framework in a running case study of our KITScenes dataset family. The datasets are available at: this https URL

67. 【2607.00696】Imprint: Online Memory Compression for Long-Horizon Egocentric QA

链接https://arxiv.org/abs/2607.00696

作者:Kousik Das,Debaditya Roy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Long-horizon egocentric, memory, Long-horizon egocentric question, Long-horizon, egocentric

备注

点击查看摘要

Abstract:Long-horizon egocentric question answering involves answering about events that have occurred hours or days in the past. This requires memory representations that remain both retrieval-effective and scalable over days or weeks of recording. Existing long-horizon egocentric QA methods construct memory as hierarchical textual summaries of observations. While effective for reducing memory size, summarization optimizes for descriptive compression rather than retrieval: repeated interactions are absorbed into coarse textual descriptions instead of being preserved as explicit, recurring memory units, making long-horizon evidence aggregation difficult. We propose Imprint, an interaction-centric memory framework that formulates long-horizon egocentric memory as an online memory compression problem rather than summarization. Incoming observations are first represented as structured Interaction Records and continuously organized into recurring interaction patterns. Using human memory consolidation signals of recurrence, recency, and distinctiveness, Imprint selectively retains and compresses interactions into a compact retrieval-oriented memory. We evaluate Imprint on EgoLifeQA, a seven-day egocentric benchmark containing questions that require reasoning over interactions occurring hours to days before the query. With the same LLM, Imprint improves QA accuracy from 31.0% to 35.8%, increases evidence-grounded answers by $6\times$ compared with EgoRAG, reduces memory footprint by $2.3\times$, and decreases retrieval latency by $11.8\times$. These results demonstrate that memory compression provides a scalable and retrieval-effective foundation for long-horizon egocentric question answering.

68. 【2607.00687】LUMA: Benchmarking Segmentation via a Lightweight Universal Mask Adapter

链接https://arxiv.org/abs/2607.00687

作者:Tobias Christian Nauen,Anosh Billimoria,Federico Raue,Stanislav Frolov,Brian B. Moser,Andreas Dengel

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)

关键词:Comparing transformer backbones, reported differences rarely, differences rarely reflect, Universal Mask Adapter, Comparing transformer

备注

点击查看摘要

Abstract:Comparing transformer backbones for image segmentation is confounded: each is paired with a different decoder, recipe, and pretraining, so reported differences rarely reflect the backbone itself. We introduce the Lightweight Universal Mask Adapter (LUMA), a lightweight, backbone-agnostic mask-transformer head that treats any backbone as a black-box feature extractor, letting a set of queries read from its features through cheap cross-attention. LUMA matches the accuracy of EoMT, the state-of-the-art efficient ViT-segmenter, at lower cost, while attaching unchanged to isotropic, hierarchical, convolutional, and mixture-of-experts backbones alike. Holding this head fixed, we benchmark 20 backbones, 11 pretraining schemes and a range of resolutions on ADE20K and Cityscapes under one modern recipe. We find that ``efficient'' token mixers fail to deliver efficiency even at the high resolutions that motivate them, with plain ViT holding the throughput Pareto-front at every resolution. Additionally, the pretraining objective, not the architecture, the lever the field has tuned hardest, governs segmentation quality.

69. 【2607.00678】ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

链接https://arxiv.org/abs/2607.00678

作者:Ronghan Chen,Yandan Yang,Zuojin Tang,Dongjie Huo,Tong Lin,Haoning Wu,Haoyun Liu,Yuzhi Chen,Lulu Zheng,Botai Yuan,Tianlun Li,Mingxin Wang,Dekang Qi,Bin Hu,Wei Mei,Yuze Xuan,Haolong Yang,Yanqing Zhu,Mu Xu,Zhiheng Ma,Xinyuan Chang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:embodied learning methods, current embodied learning, World Action Models, general-purpose robots, learning methods

备注: Code: [this https URL](https://github.com/amap-cvlab/ABot-Manipulation)

点击查看摘要

Abstract:Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

70. 【2607.00672】DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

链接https://arxiv.org/abs/2607.00672

作者:Zhengbo Zhang,Mark He Huang,Zhigang Tu,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video temporal grounding, natural language queries, Zero-shot video temporal, task-specific training, untrimmed videos

备注: Accepted to the European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:Zero-shot video temporal grounding (VTG) localizes events in untrimmed videos from natural language queries without task-specific training. Existing methods rely on frame-query feature matching, which suffices for simple events but struggles with complex multi-stage queries that require understanding temporal ordering and causal structure -- a disparity we call the reasoning gap. We propose DART (Difficulty-Adaptive Routing for Temporal Grounding), which bridges this gap by coupling difficulty-aware routing with structured reasoning in large vision-language models. A query-conditioned Determinantal Point Process (DPP) serves a dual role: selecting diverse, query-relevant keyframes as temporal evidence, and providing spectral entropy as a difficulty indicator. Simple queries are routed to a Fast path for direct prediction, while complex queries follow a Slow path with Temporal Markup Prompting, which decomposes localization into global event analysis, per-frame temporal role annotation, and boundary extraction. On Charades-STA and ActivityNet Captions, DART achieves state-of-the-art zero-shot performance across both identically distributed and multiple out-of-distribution settings, improving mIoU by up to 3.5 points over the strongest baseline while using over 7 times fewer frames. The project homepage is available at this https URL.

71. 【2607.00666】Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

链接https://arxiv.org/abs/2607.00666

作者:Taewook Kang,Taeheon Kim,Donghyun Shin,Jonghyun Choi

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:similar robot, camera pose, DART, learned tasks, environmental shifts

备注: ECCV 2026. Project page: [this https URL](https://twkang43.github.io/projects/dart)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e). Adapting these models to the shifted environment (i.e., target domain) often requires training on multiple demonstrations for each task, which are costly to collect. To reduce the burden of data curation and training, we propose an analogy-based method that adapts VLA models under environmental shifts through weight vector arithmetic with domain-specific information addition, named Domain ARiThmetic (DART). Unlike prior approaches, DART requires collecting only a single demonstration, enabling efficient adaptation. To accurately isolate domain-specific information for addition, DART performs subspace alignment between singular components in weight vectors to filter out noisy components. In both simulated and real-world experiments, DART outperforms existing VLA adaptation methods in one-shot scenarios across diverse visual and embodiment shifts. Code is available at this https URL.

72. 【2607.00654】Linguistic Relative Policy Optimization for Video Anomaly Reasoning

链接https://arxiv.org/abs/2607.00654

作者:Jiaxu Leng,Jiankang Zheng,Mengjingcheng Mo,Zhanjie Wu,Haosheng Chen,Ji Gan,Xinbo Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video anomaly detection, shown strong potential, multimodal large language, Linguistic Relative Policy, large language models

备注: Accepted at ICML 2026; 18 pages, 8 figures, 9 tables

点击查看摘要

Abstract:Video anomaly detection (VAD) with multimodal large language models has shown strong potential, yet most existing methods still depend on large-scale annotations or expert-designed priors, limiting their ability to acquire anomaly knowledge with as little human intervention as possible. To address this, we propose Linguistic Relative Policy Optimization (LRPO), which distills group-relative semantic advantages from multiple reasoning trajectories into a linguistically expressed anomaly experience prior, and adapts the model by injecting this prior into the context to steer its output distribution without any parameter updates. LRPO builds two complementary experience representations: general experience captures transferable anomaly preferences across scenarios, while scenario experience models context-dependent anomaly rules for targeted refinement. To further improve the learned experience, we introduce an anomaly alignment reward that guides trajectory optimization to match human risk preferences and reinforce temporally grounded reasoning. Extensive experiments on XD-Violence, UCF-Crime, and UBnormal demonstrate that LRPO significantly outperforms existing state-of-the-art methods under tuning-free settings.

73. 【2607.00647】Not All Prediction Targets Keep Training-Free Diffusion Guidance on the Manifold

链接https://arxiv.org/abs/2607.00647

作者:Yunsung Lee,Hyeongmin Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pretrained diffusion model, steers a pretrained, attribute at inference, pretrained diffusion, desired attribute

备注: Accepted to ECCV 2026. 15-page main paper with appendix (48 pages total, 14 figures). Project page: [this https URL](https://manluml.github.io/on-manifold-tfg)

点击查看摘要

Abstract:Training-free guidance (TFG) steers a pretrained diffusion model toward a desired attribute at inference. To be effective, this guidance must be applied from the earliest, high-noise steps of sampling. Because its objective (a classifier or energy) is defined on clean images, $\epsilon$- and $v$-prediction models must first estimate the clean image $\hat{x}$ from the noisy state at each step, and the accuracy of that estimate determines how easily guidance drifts off the data manifold. $x$-prediction, a recent alternative, outputs the clean image directly, removing this source of error even at high noise. This is our motivation. We provide a theoretical analysis of how each prediction target shapes this accuracy, and introduce guided-class FID (Child FID), a metric that exposes the manifold damage standard evaluation misses. Experiments on a new fine-grained bird benchmark and on style transfer confirm that $x$-prediction keeps guided samples on the manifold most reliably, making it the strongest foundation for training-free guidance. Code is available at this https URL

74. 【2607.00638】Uncertainty-aware tree height change regression

链接https://arxiv.org/abs/2607.00638

作者:Max Gaber,Dimitri Gominski,Jaime C. Revenga,Stefan Oehmcke,Rasmus Fensholt,Martin Brandt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Monitoring canopy height, understanding carbon sinks, Geospatial Foundation Models, canopy height change, Monitoring canopy

备注

点击查看摘要

Abstract:Monitoring canopy height change is essential for understanding carbon sinks and forest dynamics. Remote sensing enables consistent, large-scale observations of such changes, increasingly integrated with deep learning architectures such as Geospatial Foundation Models (GFMs). However, existing methods and datasets frame the problem as binary change detection, which overlooks both the continuous nature of change, especially for vegetation, and the inherent uncertainty in labels. We present the Canopy Height Change (CHC) dataset, providing 3 $\mathrm{m}$ resolution continuous canopy height differences and associated spatially resolved uncertainties across 10598 $\mathrm{km}^2$ of northern and western Spain. The dataset is paired with a co-located time series of PlanetScope satellite imagery. Based on the dataset, we introduce the task of uncertainty-aware change regression, associated metrics and strategies for fine-tuning GFMs. Furthermore, we evaluate state-of-the-art GFMs and highlight promising directions and remaining challenges for advancing continuous canopy height change estimation.

75. 【2607.00622】Learning to Watch: Active Video Anomaly Understanding via Interleaved Policy Optimization

链接https://arxiv.org/abs/2607.00622

作者:Mengjingcheng Mo,Jiaxu Leng,Xinbo Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video anomaly understanding, relies on sparse, context-dependent cues, Video anomaly, Direct Preference Optimization

备注: Accepted at ICML 2026; 25 pages, 8 figures, 15 tables

点击查看摘要

Abstract:Video anomaly understanding (VAU) relies on sparse, context-dependent cues. However, existing passive paradigms suffer from observational aliasing, where static sampling fails to disambiguate semantically distinct events. To overcome this, we propose $Anom\text{-}\pi$, a closed-loop framework that reconceptualizes video understanding as an active sequential decision-making process within a dynamic environment. Inspired by human video-reviewing behavior, this framework unifies internal cognitive reasoning and strategic evidence acquisition into an interleaved policy, utilizing temporal atomic operators such as local backtracking, temporal expansion, and fine-grained sampling to endow the model with perceptual proactivity. To learn such complex interaction strategies under video-level weak supervision, we design Interactive Direct Preference Optimization (iDPO) to achieve trajectory-level policy alignment, guided by an Active Evidence Inquiry (AEI) utility that balances task success, informative evidence acquisition, and interaction cost. This approach enables the agent to learn to actively disambiguate hypotheses while suppressing redundant exploration. Extensive experiments demonstrate that our framework, with only 2B parameters, achieves highly competitive performance, significantly outperforming state-of-the-art large-scale VAU models in complex scenarios.

76. 【2607.00620】Identifying Latent Concepts and Structures for Generalized Category Discovery

链接https://arxiv.org/abs/2607.00620

作者:Boyang Dai,Chaoqi Chen,Yizhou Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generalized Category Discovery, Generalized Category, Category Discovery, aims to recognize, recognize known classes

备注: This paper has been accepted by ICML2026

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to recognize known classes while autonomously discovering novel ones in open-world settings. However, current approaches primarily focus on designing clustering objectives, often overlooking a critical bottleneck: standard vision backbones yield high-rank, entangled token representations that are ill-suited for unsupervised discovery of latent concepts and structures. In this paper, we propose Compositional Primitive Fields (CPF-GCD), a novel representation learning framework that reshapes the feature space to make such latent structure identifiable by enforcing a low-rank compositional organization. Our core hypothesis is that all categories, whether known or novel, can be expressed as compositions and spatial arrangements of a finite set of learnable visual primitives that capture reusable concepts. CPF instantiates this geometric constraint via a spatial field mechanism. Inserted between the backbone and the head, it rewrites noisy patch tokens through low-rank primitive mixtures, effectively decomposing images into reusable atomic parts and their spatial layouts. By explicitly modeling the spatial distribution of primitives, CPF enables novel categories to emerge naturally as new activation patterns over a shared vocabulary. This shifts the focus of representation from merely partitioning global embeddings to constructing a structured and separable primitive field. Extensive experiments demonstrate that CPF serves as a generic, plug-and-play module that consistently boosts performance across diverse GCD baselines, validating that identifying and leveraging low-rank compositional structure is a crucial inductive bias for open-world recognition.

77. 【2607.00609】Diffusion-Based Multi-Class Normality for OOD Detection: An Application to CDP Authentication

链接https://arxiv.org/abs/2607.00609

作者:Bolutife Atoki(imagine, LIRIS),Iuliia Tkachenko(imagine, LIRIS),Bertrand Kerautret(imagine, LIRIS),Carlos Crispim-Junior(imagine, LIRIS)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstruction-based generative models, produce comparable anomaly, normality modelling requires, capture multiple in-distribution, multiple in-distribution manifolds

备注: IEEE International Conference on Advanced Visual And Signal-Based Systems, Aug 2026, Lecce, Italy

点击查看摘要

Abstract:Reconstruction-based generative models offer a natural framework for unsupervised out-of-distribution (OOD) detection, but multi-class normality modelling requires a single detector to capture multiple in-distribution manifolds and produce comparable anomaly scores across classes. We study this problem in copy detection pattern (CDP) authentication, where authentic and counterfeit samples are visually similar but differ in subtle printing-and-digitisation (P\D) signatures. We propose a diffusion based multi-class normality framework in which a single class-conditional ControlNet is trained exclusively on authentic CDPs from multiple P\D classes and detects counterfeits through reconstruction error under authentic-class conditioning. We further introduce dual template masking, which hides complementary regions of the input template and scores only withheld pixels, reducing reliance on visible binary structure. On the Indigo 1 x 1 Base dataset, the proposed method outperforms traditional and adapted generative baselines under multi-class authentic-versus-counterfeit evaluation, without using counterfeit samples for training or threshold calibration.

78. 【2607.00606】Retrieved Images as Visual Thought: Training-Free Multimodal In-Context Learning for the Open-vs-Closed Gap

链接https://arxiv.org/abs/2607.00606

作者:Bingchen Huang,Zhiling Wang,Yifu Chen,Yuanchao Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:invokes external tools, expensive training pipeline, model invokes external, Images makes vision, Recent work

备注: 12 pages, 6 figures. Includes appendix. Introduces the MAAC-Bench benchmark

点击查看摘要

Abstract:Recent work on Thinking with Images makes vision a dynamic part of reasoning, but does so through generation: the model invokes external tools, synthesizes code, or imagines new imagery, each at the cost of a tool protocol, brittle code, or an expensive training pipeline. A fourth route makes vision dynamic without generating anything, by retrieving labeled exemplar images and reasoning over them, yet it remains underexplored despite being train-free. We present ReVisIT, a train-free framework that realizes this retrieval-based route by treating each retrieved image-label pair as a unit of visual thought. ReVisIT combines structured class definitions, per-query multimodal retrieval of exemplars, and alternating user/assistant injection of those exemplars before joint multi-attribute decoding, and degrades gracefully to whichever components a task admits. On VL-ICL Bench Fast Open MiniImageNet, Qwen3-VL-30B-A3B with ReVisIT reaches 98.5% at 4-shot, statistically indistinguishable from the 72B LLaVA-OneVision SOTA (98.7%) on this near-saturated task at about 1/2.4 the parameters, while the same backbone without the scaffold sits at chance. The turns layer alone adds 26.1 points to GPT-4.1 on free-form concept induction (Bongard-OpenWorld), and the full stack yields a 4-6 point macro gain across three backbones on MAAC-Bench, a new license-clean 27-class, 5-attribute benchmark, significant by paired bootstrap on the curator-derived attributes. Component analysis shows that retrieval-plus-turns is the universal lever while structured definitions are need-adaptive, and that 83% of the retrieval gain comes from retrieval quality rather than from the presence of exemplars. MAAC-Bench is released with a rubric-grounded LLM verification protocol that replaces author spot-check on subjective attributes.

79. 【2607.00596】Semantic-Guided Reading Order Reconstruction in Historical Armenian Newspapers with LLMs

链接https://arxiv.org/abs/2607.00596

作者:Chahan Vidal-Gorène(CJM, LIPN),Nadi Tomeh(LIPN),Victoria Khurshudyan(Inalco, SeDyL)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited language resources, paper addresses reading, addresses reading order, reading order reconstruction, combine complex layouts

备注: International Conference on Pattern Recognition, 2026, Lyon, France

点击查看摘要

Abstract:This paper addresses reading order reconstruction in historical Armenian newspapers, which combine complex layouts with limited language resources. We introduce a new annotated dataset of 66 pages and compare geometric heuristics, YOLO-based layout parsing, an end-to-end document model ECLAIR, and a hybrid method combining semantic zone detection with a generative LLM. Our hybrid method achieves the lowest error rates of all evaluated approaches, reducing ordering errors by up to 76% over the strongest geometric baseline, and remains robust in multi-page settings and under noisy OCR. Rather than targeting production the method is designed as a data bootstrapping strategy enabling rapid annotation in highly under-resourced scenarios. Alongside the dataset, we release a specialized Tesseract OCR model for historical Armenian print.

80. 【2607.00595】GADA: Geometry-Aware Deformable Aggregation for Image-Based Gaussian Splatting

链接https://arxiv.org/abs/2607.00595

作者:Siwoo Lim,Sunjae Yoon,Gwanhyeong Koo,Chang D. Yoo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved significant improvements, incorporating warping-based techniques, achieved significant, significant improvements, improvements by incorporating

备注: ICML 2025

点击查看摘要

Abstract:Gaussian Splatting has achieved significant improvements by incorporating warping-based techniques. However, such methods suffer from pixel-level inaccuracies due to uncertain geometry. This uncertainty leads to spatial misalignments in the warped images, which disrupt residual learning used in warping-based methods and fundamentally limit the gains of correction, particularly on thin structures and high-frequency details. Driven by our insight that useful visual cues are not lost but locally preserved under slight displacement, we propose Geometry-Aware Deformable Aggregation (GADA). This method introduces an iterative refinement module with deformable offsets to actively correct spatial misalignments and recover these displaced cues. Furthermore, to address the limitations of standard pipelines where visibility checks (i.e., thresholding) often discard valid pixels and multi-view warped image fusion relies on naive mean aggregation, our module is coupled with an implicit confidence weighting mechanism that selectively suppresses unreliable evidence. Consequently, our approach outperforms prior warping-based Gaussian Splatting, preserving high-frequency quality while achieving 2.13 times faster FPS.

81. 【2607.00580】Active Spatial Guidance: Eliminating Injected Positional Mechanisms in Vision Transformers

链接https://arxiv.org/abs/2607.00580

作者:Cong Liu,Xiaofang Li,Simon X. Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, self-attention permutation invariance, address self-attention permutation, commonly rely, permutation invariance

备注

点击查看摘要

Abstract:Vision Transformers (ViTs) commonly rely on injected positional mechanisms to address self-attention's permutation invariance. Motivated by the spatial regularities of natural images, we ask whether spatial organization can be induced from data rather than explicitly injected. Under controlled, matched from-scratch training, we propose Active Spatial Guidance (Guidance), a training-only objective that disables positional injection and applies an auxiliary 2D coordinate-regression loss to the final-layer patch tokens. The guidance head is used only during training and removed for inference; the deployed model consists of a positional-injection-free ViT encoder and the task-specific prediction module. Using DINOv3 ViT backbones, Guidance consistently improves performance on ImageNet-100 classification, ADE20K semantic segmentation, and Hypersim monocular depth estimation, outperforming strong injected baselines such as learned absolute positional embeddings and rotary positional embeddings under identical training protocols. On ImageNet-100, broader comparisons against representative injected positional designs further support Guidance's effectiveness. Guidance also improves robustness under resolution transfer, and multi-resolution training further strengthens accuracy across input sizes. Overall, our results suggest that spatial inductive bias in ViTs need not be architecturally injected, but can be shaped through training-time supervision. The code used for training and evaluation is publicly available in this https URL.

82. 【2607.00579】EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

链接https://arxiv.org/abs/2607.00579

作者:Mattia D'Urso,Christian Sormann,Mattia Rossi,Friedrich Fraundorfer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Edge-based Pose Optimization, Edge-based Pose, Foundation Models, framework specifically designed, Pose Optimization

备注: Accepted at ECCV 2026

点击查看摘要

Abstract:We introduce \textbf{Edge-based Pose Optimization (EPO)}, a trackless geometric optimization framework specifically designed to boost the Structure-from-Motion reconstructions generated by 3D Foundation Models. These models achieve rapid inference by bypassing the time-consuming feature extraction and matching stages of traditional pipelines, where explicit correspondences between each 3D point and multiple images, referred to as tracks, are established. However, their geometric accuracy currently falls short of traditional pipelines. While this can be addressed in a post-processing step via Bundle Adjustment-like refinement, doing so requires extracting feature tracks, thus defeating the original speed advantage. Instead, our fully differentiable framework uses edge map alignment as a proxy for geometric optimization, avoiding feature extraction and track construction entirely. Through extensive evaluation across multiple datasets and tasks, we demonstrate that EPO matches or outperforms Bundle Adjustment-like methods while requiring significantly lower runtime and memory. Notably, its reduced memory footprint makes EPO suitable for consumer-grade hardware, where competing refinement methods cannot run.

83. 【2607.00578】Caption Bottleneck Models

链接https://arxiv.org/abs/2607.00578

作者:Seref Baris Cagliyan,Umut Ozdemir,Merve Tapli,Emre Akbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bottleneck Models, routing predictions, Concept Bottleneck Models, Caption Bottleneck Models, Concept

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) provide interpretability by routing predictions through a layer of human-understandable concepts. However, defining an optimal concept set for a specific dataset remains an open challenge. Existing approaches rely on expensive expert annotations or LLM-generated lists based solely on class names. Even "open-vocabulary" variants typically depend on static concept sets, which restrict discovery and introduce label bias. Furthermore, traditional CBMs often suffer from information leakage, where unmodeled visual features bypass the bottleneck and compromise the integrity of the explanations. To overcome these limitations, we propose Caption Bottleneck Models (CaBM), a framework that circumvents the need for predefined concept sets by replacing rigid concept layers with free-form natural language. By representing images via LMM-generated captions and training a classifier strictly on this text, CaBM ensures a leakage-free architecture by construction. Additionally, by analyzing the text classifier post-training, CaBM autonomously discovers high-quality, dataset-specific concepts. Our results across fine- and coarse-grained benchmarks demonstrate that CaBM achieves competitive accuracy while preserving interpretability without the constraints of external dictionaries or manual labeling.

84. 【2607.00573】BrainFIBRE: A Foundation Model via Information Decomposition for Brain Microstructure

链接https://arxiv.org/abs/2607.00573

作者:Zijian Dong,Yi Lin,Ji Fang,Jianxiong Zhou,Kwun Kei Ng,Juan Helen Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion MRI probes, MRI probes brain, Diffusion MRI, MRI probes, orientation dispersion index

备注: ECCV 2026. The first three authors contributed equally

点击查看摘要

Abstract:Diffusion MRI probes brain microstructure with particular sensitivity to early cerebrovascular and neurodegenerative changes. Neurite Orientation Dispersion and Density Imaging (NODDI) decomposes the diffusion signal into three biophysically interpretable maps: neurite density index (NDI), orientation dispersion index (ODI), and free water fraction (FWF), capturing neurite packing, fiber coherence, and extracellular fluid. These 3D maps offer a rich substrate for transferable microstructural representations, yet integrating them is challenging: standard representation learning struggles to disentangle the unique information in each map from their shared and synergistic interactions. We present BrainFIBRE, the first foundation model for brain microstructure, pretrained on NODDI-derived maps from 55,592 UK Biobank participants. We propose Self-supervised Partial Information Decomposition (SPID), which extends PID-guided multimodal learning to the self-supervised regime for the first time. A novel Counterfactual Candidate Construction (CCC) paradigm perturbs inter-modality alignment through modality dropping and swapping, providing the contrastive signal for a Mixture-of-Experts architecture to disentangle unique, synergistic, and redundant information without any downstream label. On both Caucasian and Asian cohorts, BrainFIBRE achieves state-of-the-art performance across diverse tasks predicting age, sex, cerebrovascular and neurodegenerative markers, and cognition, while yielding neurobiologically interpretable representations that reveal task- and cohort-specific interaction patterns. BrainFIBRE establishes a versatile foundation for neuroimaging analysis at the microstructural level.

85. 【2607.00547】EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

链接https://arxiv.org/abs/2607.00547

作者:Jihyeok Jung(1),Jeewu Lee(2),Sanghyeop Kim(2),Chanhee Han(3),Seong Joon Oh(1) ((1) KAIST AI, (2) Sogang University, (3) Ministry of Science and ICT)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:egocentric, Egocentric Action Selection, evaluate egocentric perspective, primarily constructed, action selection

备注: 15 pages, 2 figures, 8 tables. Code and benchmark are available at [this https URL](https://github.com/jhCOR/EgoGapBench)

点击查看摘要

Abstract:Existing egocentric benchmarks have primarily constructed the egocentric setting from first-person-view data, which makes it difficult to evaluate egocentric perspective itself in isolation. However, understanding first-person-view input and taking an egocentric perspective are separable abilities, especially when first-person body cues are absent or when other agents are present. To isolate egocentric perspective understanding, we introduce EgoGapBench, a diagnostic benchmark for measuring action selection in multi-agent egocentric scenes. We define the ability measured by this benchmark as Egocentric Action Selection (EAS): selecting an appropriate action from the agent's perspective in the presence of other agents. On EgoGapBench, humans answer reliably, whereas both open-source and proprietary MLLMs perform substantially worse and systematically select actions performed by other visible agents. Fine-tuning on existing egocentric data fails to close this gap and can even be detrimental. In contrast, fine-tuning on EgoGapBench training data improves accuracy but does not reach human performance. These results show that EAS is difficult to acquire from first-person-view data alone, and that MLLMs should be evaluated and trained not only for scene understanding but also for egocentric action selection.

86. 【2607.00545】ECoSim: Data Efficient Fine-Tuning for Controllable Traffic Simulation

链接https://arxiv.org/abs/2607.00545

作者:Yu-Hsiang Chen,Wei-Jer Chang,Yi-Ting Chen,Masayoshi Tomizuka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require retraining large, extensive annotated data, autonomous driving systems, retraining large generative, testing autonomous driving

备注: European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:Controllable traffic simulation is critical for testing autonomous driving systems, yet existing approaches often require retraining large generative models with extensive annotated data. We introduce a lightweight control adaptation framework that enables multi-modal controllability (sketch, latent behavior codes, and text) for pretrained state-of-the-art diffusion and autoregressive traffic models. By modulating intermediate features through identity-initialized FiLM layers, our method efficiently adds new control modalities while preserving the base model's generative prior. Evaluated on Waymo Open Sim Agents Challenge, our approach demonstrates strong controllability with less than 1% of the paired control data. Through context-aware condition transfer, our framework enables counterfactual scenario generation and long-tail synthesis while maintaining stable closed-loop driving realism and safety. Our framework unlocks new possibilities for controllable traffic simulation, enabling targeted scenario generation through lightweight adaptation of pretrained generative models. Project page: this https URL

87. 【2607.00544】GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation and Data Engine

链接https://arxiv.org/abs/2607.00544

作者:Yanan Wang,Wen Li,Yibin Ying,Zhenghao Fei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires localizing targets, localizing targets based, segmentation requires localizing, Grounded Explainable Agent, requires localizing

备注: 21 pages, 8 figures

点击查看摘要

Abstract:Reasoning segmentation requires localizing targets based on complex, implicit queries. Current end-to-end models typically entangle perception and deduction into an opaque black box, severely limiting interpretability and scalability. To address this, we propose GEAR-Seg (Grounded Explainable Agent for Reasoning Segmentation), an explicitly decoupled agent that shifts the paradigm by translating visual pixels into dense, attribute-rich text. By decoupling class-agnostic segmentation, semantic description, and Large Language Model (LLM) deduction, GEAR-Seg transforms implicit reasoning into an explicit, trackable logic chain. As a zero-shot inference framework, it achieves highly competitive performance across diverse reasoning and fine-grained referring segmentation benchmarks. Furthermore, GEAR-Seg inherently functions as a highly scalable data engine. Utilizing this engine, we construct GEAR-131K, a massive benchmark (over 38k images, 656k QA-mask pairs) introducing a multifaceted taxonomy tailored for complex real-world manipulation-oriented reasoning. Finally, distillation experiments demonstrate that lightweight models supervised exclusively by our automated pipeline closely match the upper-bound performance of costly human-annotated baselines.

88. 【2607.00535】Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

链接https://arxiv.org/abs/2607.00535

作者:Zhiqi Li,Wen Zhang,Bo Zhu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:accelerate sampling, noise and data, learning long-range transport, long-range transport maps, Few-step flow-map generators

备注: 31 pages, 29 figures

点击查看摘要

Abstract:Few-step flow-map generators, such as consistency models and MeanFlow, accelerate sampling by directly learning long-range transport maps between noise and data. However, these models are typically deterministic, which makes them difficult to optimize with reinforcement learning (RL) post-training methods that require stochastic trajectories and well-defined likelihood ratios. Existing SDE-based stochasticization techniques are designed for velocity-based samplers with infinitesimal or finely discretized transitions, and therefore do not directly apply to long-range flow maps. In this work, we propose Flow-Map GRPO, an online RL post-training framework for deterministic few-step flow-map generators. The key component is Anchored Stochastic Flow Map Composition (ASFMC), a path-preserving stochasticization mechanism that introduces randomness through anchor-based conditional resampling while preserving the original marginal probability path of the deterministic flow map. We derive GRPO objectives for both single-time and two-time flow-map parameterizations. Experiments on few-step FLUX-based text-to-image generators, including MeanFlow and sCM, show that Flow-Map GRPO improves pretrained deterministic flow-map models across reward-based, perceptual, and task-level evaluation metrics. Our results demonstrate that deterministic few-step flow-map generators can be effectively aligned with RL post-training without modifying their original model parameterization or retraining them as native stochastic models.

89. 【2607.00529】NoPA: Non-Parametric Online 3D Scene Graph Generation

链接https://arxiv.org/abs/2607.00529

作者:Qi Xun Yeo,Seungjun Lee,Yan Li,Gim Hee Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate intermediate point-cloud, generation approaches fail, scene graph generation, intermediate point-cloud representations, graph generation approaches

备注: This paper has been accepted in ECCV 26

点击查看摘要

Abstract:Classic 3D scene graph generation approaches fail to work in real-time due to the heavy computational cost of environment mapping and the need to generate intermediate point-cloud representations. To alleviate this issue, a recent work eschews point clouds in favor of a lightweight Gaussian distribution for each object. This approximation drastically speeds up inference and enables real-time 3D scene graph generation. However, the representation has two key weaknesses. \textbf{1)} Each object is approximated by a single 3D Gaussian, which causes a severe loss of 3D geometric detail. \textbf{2)} The discrepancy between this approximation and the true object geometry exacerbates the inaccurate merging of object candidates during online inference. To address these issues, we propose \textbf{NoPA}, which represents each object as a separate non-parametric distribution. This formulation retains 3D geometric information while preserving real-time inference of the parametric Gaussian formulation. To build upon our novel object representation, we propose a tailored merging strategy to recover coherent object instances. Specifically, we leverage maximum mean discrepancy on kernel density estimates to enable robust merging of object candidates during online exploration while minimizing added computational complexity. The key is to maintain a fixed particle set per object. Furthermore, to rectify the relation loss caused by misclassified objects, NoPA propagates relationships between objects with high affinity. Experiments show that NoPA substantially outperforms current methods without sacrificing real-time inference speed.

90. 【2607.00525】SPECSIA: Stylization Dataset for Novel-View Enhancement in Drawing-based 3D Animation

链接https://arxiv.org/abs/2607.00525

作者:Kyuwon Kim,Sunjae Yoon,Chang D. Yoo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserve character appearance, Generating animation, drawing is challenging, output must preserve, appearance while remaining

备注: ECCV 2026

点击查看摘要

Abstract:Generating animation from a single 2D drawing is challenging because the output must preserve character appearance while remaining plausible and temporally coherent under motion. Existing drawing-based 3D animation pipelines often use sample-wise 2D refinement to align animated renderings with the input image, but such optimization tends to overfit to the observed view and fails to correct projection-induced artifacts in novel views. To address this limitation, we introduce SPECSIA-15K, a paired stylization dataset containing 14,980 artifact-corrupted projection/refinement-target pairs from 1,498 3DBiCar characters. We further present DraViE (Drawing-based View Enhancement), a lightweight plug-and-play module trained with data-level priors to remove novel-view artifacts while preserving style and motion plausibility. Experiments show consistent gains in novel-view fidelity and temporal coherence with lower per-character adaptation cost than sample-wise fine-tuning.

91. 【2607.00522】Restore3D: Breathing Life into Broken Objects with Shape and Texture Restoration

链接https://arxiv.org/abs/2607.00522

作者:Xiaolong Shen,Zongxin Yang,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cultural heritage preservation, Restoring incomplete, incomplete or damaged, heritage preservation, artistic design

备注

点击查看摘要

Abstract:Restoring incomplete or damaged 3D objects is crucial for cultural heritage preservation, occluded object reconstruction, and artistic design. Existing methods primarily focus on geometric completion, often neglecting texture restoration and struggling with relatively complex and diverse objects. We introduce Restore3D, a novel framework that simultaneously restores both the shape and texture of broken objects using multi-view images. To address limited training data, we develop an automated data generation pipeline that synthesizes paired incomplete-complete samples from large-scale 3D datasets. Central to Restore3D is a multi-view model, enhanced by a carefully designed Mask Self-Perceiver module with a Depth-Aware Mask Rectifier. The rectified masks learned by the self-perceiver guide an image integration and enhancement phase, helping retain observed shape and texture patterns while refining the generated regions and mitigating the low-resolution limitations of the base model, yielding high-resolution, semantically coherent, and view-consistent multi-view images. A coarse-to-fine reconstruction strategy is then employed to recover detailed textured 3D meshes from refined multi-view images. Experiments on synthetic and real broken-object benchmarks show that Restore3D improves multi-view restoration quality and textured-mesh reconstruction over representative inpainting, completion, and reconstruction baselines in the evaluated settings. Project Page: this http URL

92. 【2607.00514】Cross4D-JEPA: Dense Cross-modal Correspondence Distillation for 4D Point Cloud Representation Learning

链接https://arxiv.org/abs/2607.00514

作者:Trung Thanh Nguyen,Hai Nguyen-Truong,Tu Vo,Hoang M. Truong,Tuan-Anh Vu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Automatic understanding, understanding of dynamic, sequences captured, sensors and LiDAR, embodied perception

备注

点击查看摘要

Abstract:Automatic understanding of dynamic 4D point clouds, the 3D-point sequences captured over time by depth sensors and LiDAR, is central to robotics and embodied perception. Yet annotating them densely is expensive, making self-supervised pretraining the natural route to transferable representations. Existing pretext tasks, however, are almost entirely intra-modal, and the few methods that transfer knowledge from 2D foundation models rely on a single global embedding per clip, discarding the rich per-patch semantics that these models compute. To address this gap, we propose Cross4D-JEPA, a teacher-student method that distills a frozen 2D foundation model, an image model DINOv2, or a video model V-JEPA 2, into a 4D point encoder. The proposed method combines (1) a dense cross-modal correspondence that maps every 3D point to the teacher patch feature it projects to, and (2) a per-point objective that trains the student to match these features in latent space with no masking, negatives, or decoder. We evaluate Cross4D-JEPA on four benchmarks, MSR-Action3D, DeformingThings4D, NTU-RGB+D 60, and HOI4D, against intra-modal and global cross-modal baselines. Experimental results show that, under a matched protocol, the proposed method consistently outperforms intra-modal and global cross-modal baselines across the four benchmarks and is competitive with heavier published 4D methods; further analysis attributes this gain primarily to the granularity of the correspondence rather than the teacher modality. Beyond recognition accuracy, the dense representation learned by Cross4D-JEPA transfers across domains, improves label efficiency, and improves full-label fine-tuning under the same training budget, while a 13x smaller encoder matches a heavyweight pooling backbone.

93. 【2607.00509】AnF-DiffPET: Anatomy- and Frequency-Guided Diffusion for PET/CT Denoising

链接https://arxiv.org/abs/2607.00509

作者:Xuepeng Liu,Ruili Li,Zetong Liu,Renyiming Li,Yan Li,Yin Dai,Chao Li,Yueyang Teng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Positron emission tomography, time produces low-dose, essential functional information, reducing injected activity, acquisition time produces

备注: 11 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Positron emission tomography (PET) provides essential functional information for disease assessment, however reducing injected activity or acquisition time produces low-dose (LD) PET with stronger count dependent noise and less reliable uptake quantification. Diffusion models offer a promising solution for PET denoising by progressively recovering high-dose (HD) PET images from LD inputs. However, LD-to-HD PET denoising is still challenging due to insufficient anatomical guidance, unstable multi-scale feature propagation, and uncertain frequency domain uptake recovery. We propose AnF-DiffPET, an anatomy- and frequency-guided diffusion framework for computed tomography (CT) conditioned LD PET denoising. The framework integrates Anatomical-Frequency Guidance (AFG), Multi-Scale Cross-Transformer Reconstruction (MSCTR), and Frequency-Contrastive Hard Mining (FCHM) to enhance anatomy aware feature modulation and frequency domain consistency during denoising. Experimental results across four PET/CT datasets show that the proposed method improves image fidelity, anatomical consistency, and quantitative fidelity over representative CNN-based, GAN-based, transformer-based, and diffusion-based methods. The code and trained models will be publicly released upon acceptance.

94. 【2607.00499】Prior-Anchored Debiasing for Long-Tailed Multi-Organ Pathology Report Generation

链接https://arxiv.org/abs/2607.00499

作者:Feng Yang,Jie Liu,Yubo Pang,Peilin Chen,Xinheng Lyu,Shiqi Wang,Howard Leung,Ping Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Slide Images, Automated pathology report, attracted increasing attention, pathology report generation, Toggle

备注

点击查看摘要

Abstract:Automated pathology report generation from Whole Slide Images (WSIs) has attracted increasing attention in digital pathology. However, existing methods are predominantly developed under single-organ settings, overlooking the multi-organ scenarios encountered in clinical practice, where organ types typically follow a long-tailed distribution. To address this gap, we identify two critical biases: (1) visual representation bias, where the encoder favors head-class patterns over tail-class discriminative features, and (2) textual decoding bias, where the decoder overfits to head-class narrative patterns, yielding diagnostically unreliable outputs for tail-class organs. To mitigate these two biases, we propose a novel Prior-anchored multi-Organ pathology report Generation framework (PriOrGen). Specifically, a Visual-Prototype Anchored Bottleneck module leverages the information bottleneck principle with learnable anchor representations to selectively retain diagnostically relevant visual information while filtering out head-biased redundancy. Secondly, a Meta-Report Anchored Bank module constructs an organ-specific meta-report anchored bank and retrieves organ-faithful textual priors to steer the decoder away from head-class narrative patterns. Extensive experiments on a multi- organ pathology dataset demonstrate that our method effectively mitigates long-tail biases and achieves superior report generation performance across both head and tail organ categories compared to state-of-the-art methods.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2607.00499 [cs.CV]

(or
arXiv:2607.00499v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2607.00499

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Feng Yang [view email] [v1]
Wed, 1 Jul 2026 06:31:39 UTC (608 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Prior-Anchored Debiasing for Long-Tailed Multi-Organ Pathology Report Generation, by Feng Yang and 7 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-07

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

    We gratefully acknowledge support from
    our major funders,
    member institutions, ,
    and all contributors.

About

Help

Contact

Subscribe

Copyright

Privacy

Accessibility

Operational Status (opens in new tab)

Major funding support from

95. 【2607.00498】Robust 3D Alignment of Generative Reconstructions via Partial Monocular Observations

链接https://arxiv.org/abs/2607.00498

作者:Yuchen Zhang,Luanyuan Dai,Yiwei Wang,Xiwei Xu,Jianing Zhang,Johnny.r.zhang,Xianhui Meng,Yanbiao Ma,Jiayi Ma,Xiaoshuai Hao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:partial monocular observations, Aligning generative, reconstructions with partial, computer vision, critical but under-explored

备注

点击查看摘要

Abstract:Aligning generative 3D reconstructions with partial monocular observations is a critical but under-explored challenge in computer vision. This task is inherently ill-posed due to severe asymmetries between noisy, sparse monocular inputs and dense generative priors, whose scale ambiguity and geometric hallucinations, combined with the lack of initial overlap, render traditional registration pipelines ineffective. To resolve these issues, we propose a training-free and interpretable geometric alignment framework that grounds generative 3D priors via a 3D similarity transformation (Sim(3)), which can recover accurate metric scale and pose. Specifically, we introduce an explicit scale factor to resolve metric ambiguity and employ a coarse-to-fine alignment strategy, leveraging geometry-aware descriptors for robust initialization and a decoupled closed-form solver for precision refinement. In addition, we introduce a Hallucination Filtering operation to effectively suppress outliers caused by hallucinated geometry. To evaluate alignment performance under these extreme conditions, we introduce GenPMOAlign--Where2Place, a rigorous benchmark specifically designed for Generative-to-Partial Monocular Observational Alignment. Experiments demonstrate that our method achieves stable and accurate registration, substantially outperforming both classical geometric pipelines and state-of-the-art learning-based baselines. Code and the benchmark will be publicly released.

96. 【2607.00494】HieDG: A Hierarchical Discrete Geometry-Guided Framework for Multi-Animal Tracking

链接https://arxiv.org/abs/2607.00494

作者:Chenxun Deng,Zhongde Zhang,Ye Yuan,Chengyang Zhang,Yifan Zhang,Bohao Chen,Hongying Yan,Hang Zhou,Hua Han,Xi Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging due, high density, behavioral analysis, irregular motion, Multi-animal tracking

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Multi-animal tracking (MAT) is critical for wildlife monitoring and behavioral analysis, yet remains challenging due to uniform appearance, high density, and irregular motion. Existing methods typically follow heuristic- or query-based paradigms: the former relies on handcrafted geometric associations without end-to-end optimization, whereas the latter enables joint optimization but relies heavily on appearance embeddings. In such conditions, continuous geometric embeddings can be unstable, as small coordinate perturbations may disproportionately alter cross-frame attention weights, degrading identity association performance. To address this limitation, we propose HieDG, a Hierarchical Discrete Geometry-guided tracking framework that reformulates geometric dynamics as structured discrete representations within a query-based tracker. Instead of directly using raw geometric signals, HieDG employs a two-stage residual codebook to discretize position, scale, and velocity cues, transforming unstable continuous geometry into structured, stable discrete tokens. These tokens are aligned with visual embeddings and integrated into the tracking queries to enhance identity consistency. Extensive experiments on animal-specific benchmarks (AnimalTrack, BFT, and BuckTales) demonstrate state-of-the-art association performance with significant improvements in HOTA, AssA, and IDF1. Additional evaluations on generic multi-object tracking benchmarks, including DanceTrack and SportsMOT, show competitive performance, indicating the broader applicability of discretized geometric modeling beyond animal-specific scenarios.

97. 【2607.00492】GenSP: Consistent Spherical Parameterization via Learning Shape Generative Models

链接https://arxiv.org/abs/2607.00492

作者:Sai Karthikey Pentapati,Shashank Gupta,Rajesh Sureddi,Yuezhi Yang,Alan C. Bovik,Qixing Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:data-driven framework, spherical parameterizations, shapes, learns consistent spherical, consistent spherical parameterizations

备注: Accepted at ECCV 2026. Sai Karthikey Pentapati and Shashank Gupta contributed equally to this work

点击查看摘要

Abstract:We introduce GenSP, a data-driven framework that learns consistent spherical parameterizations across a collection of genus-0 shapes. Instead of optimizing the parameterization of each shape independently, our method learns a neural generative model that predicts a continuous mapping from the unit sphere to shapes in a dataset. Under this formulation, spherical parameterizations are obtained through the inverse mappings of the learned generator, which encourages similar shapes to share consistent parameterizations. To make this formulation practical, we address several key challenges in learning such a generative model. First, we introduce a continuous neural deformation model that predicts surface points from sphere coordinates and latent shape codes, avoiding discretization artifacts common in mesh-based formulations. Second, we augment the training space with intermediate shapes that bridge the sphere and input shapes, allowing the model to learn meaningful deformations across a heterogeneous shape collection. Third, we compute reliable initial correspondences by propagating mappings along a spanning tree of training shapes in the latent space. Experiments on the ShapeNet dataset demonstrate that our approach significantly reduces geometric distortion and improves cross-shape consistency compared with state-of-the-art spherical parameterization methods.

98. 【2607.00491】MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

链接https://arxiv.org/abs/2607.00491

作者:Leyuan Yu,Xiao Tang,Minghao Liu,Xinyuan Li,Xiaokai Bai,Sheng Zhou,Qunshu Lin,Weihao Xuan,Naoto Yokoya

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:models describe relations, test observational spatial, vision-language models, models describe, test observational

备注: 18 pages, 7 figures. Dataset available at [this https URL](https://huggingface.co/datasets/ZODAOfficial/MindEdit-Bench)

点击查看摘要

Abstract:Benchmarks for vision-language models (VLMs) mostly test observational spatial reasoning: models describe relations already visible in the input. Existing what-if tasks typically vary the observer while keeping the scene fixed. Can VLMs instead predict the consequences of hypothetically moving or rotating an object? We introduce MindEdit-Bench, a benchmark of six spatial reasoning tasks built from three-photo smartphone triplets of newly captured indoor scenes via an automatic in-the-wild 3D scene-graph extraction pipeline. Four tasks probe perception and perspective transformation over observed structure; two new tasks, L4 (spatial editing) and L5 (cross-view visibility editing), probe object-level counterfactual reasoning, where correct answers are absent from all input images. Each question provides 8-24 structured answer choices, enabling answer-letter-level diagnosis of spatial and fallback errors. The benchmark covers 120 private indoor scenes not drawn from public datasets, reducing public-data pretraining-overlap risk. Across 15 VLMs on 1,003 human-verified questions, task-wise mean VLM accuracy is only 8%-31%, versus 81%-97% human majority-vote accuracy. The pooled human--best-VLM gap is 53 pp, with at least 39 pp on every task. The structured answer space further reveals non-uniform failures, including weaker camera-depth-axis inference and fallback behavior on difficult visibility-editing cases.

99. 【2607.00486】PAPA: Online Personalized Active Preference Alignment

链接https://arxiv.org/abs/2607.00486

作者:Anindya Sarkar,Nasik Muhammad Nafi,Isaac Lyngaas,Muralikrishnan Gopalakrishnan Meena,Yevgeniy Vorobeychik

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:modeling complex data, complex data distributions, including images, images and text, highly effective

备注: Accepted to ECML PKDD 2026

点击查看摘要

Abstract:Diffusion models are highly effective at modeling complex data distributions, including images and text. However, in applications like personalized recommender systems, the objective often shifts to modeling specific regions of the distribution that maximize user preferences-initially unknown but gradually uncovered through interactive feedback. This can naturally be framed as a reinforcement learning problem, where the goal is to fine-tune a diffusion model to maximize a reward function based on preferences. However, the main challenge lies in learning a parameterized reward model, which typically requires large-scale preference data-something that is often not feasible in practice. In this work, we introduce Personalized Active Preference Alignment PAPA, a novel method that bypasses the requirement for a parametrized reward model by directly optimizing the diffusion model using real-time user feedback. PAPA enables feedback-efficient preference alignment, drawing inspiration from the variational inference framework. We demonstrate PAPA's effectiveness through extensive experiments and ablation studies across diverse class-conditioned and fine-grained alignment tasks. Additionally, based on theoretical insights, we propose an enhanced fine-tuning strategy, referred to as EPAPA, that requires less computational budget and accelerates the fine-tuning process, further boosting PAPA's suitability for real-world deployment. Our code is made publicly available at this https URL.

100. 【2607.00465】StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

链接https://arxiv.org/abs/2607.00465

作者:Yuan Qing,Chengzhi Mao,Boqing Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Visual Instruction Tuning, Large Vision-Language Models, Instruction Tuning, Large Vision-Language, Visual Instruction

备注: Accepted to ECCV 2026. Project page and code: [this https URL](https://yuanqing-ai.github.io/StochasT)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same image for conversational, multi-turn training, whereas existing benchmarks evaluate LVLMs in isolated, single-turn scenarios. The models can suffer from visual attention decay and contextual overfitting during multi-turn training, making it hard for them to realize their full potential in the mismatched test phase. To close the gap, we propose learning with Stochastic Turn Depth (StochasT), which stochastically groups language tasks for the same image into clusters of varying sizes (turn depth) while preserving their organic order. Hence, while StochasT draws on Dropout and stochastic depth for ResNets, it does not actually drop anything to maximize the utility of the training data. Furthermore, we introduce a challenging, benchmark-agnostic evaluation mechanism based on the Balanced Latin Square to measure LVLMs' robustness under varying contextual dependencies. Extensive experiments demonstrate that StochasT effectively grants LVLMs strong, harmonized capabilities for both single-turn and multi-turn use cases.

101. 【2607.00461】Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

链接https://arxiv.org/abs/2607.00461

作者:Shijie Li,Yilin Gao,Siyuan Yang,Tieyuan Chen,Chaofan Gan,Zhihao He,Zicheng Zhao,Yuyu Guo,Weiyao Lin,Hang Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, lose perceptual nuance, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage''. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

102. 【2607.00446】VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

链接https://arxiv.org/abs/2607.00446

作者:Seohyun Lee,Seoung Choi,Dohwan Ko,Jongha Kim,Hyunwoo J. Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:subsequently perform fine-grained, retrieve relevant videos, continue to expand, increasing demand, retrieve relevant

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:As video corpora continue to expand in both scale and task complexity, there is increasing demand for approaches that retrieve relevant videos from large-scale corpora (inter-video reasoning) and subsequently perform fine-grained, query-conditioned tasks (intra-video reasoning) within the retrieved content, such as temporal grounding. However, existing approaches typically treat retrieval as a preprocessing step, and consequently, when the initial retrieval fails, there is no mechanism to refine the search, leading to the failure of subsequent fine-grained intra-video reasoning. Moreover, while recent agentic frameworks have advanced video understanding, they typically assume that the query-relevant video is already given, focusing exclusively on intra-video reasoning tasks. To address these limitations, we propose VideoSearch-R1, an agentic framework for iterative video retrieval and reasoning through multi-turn interaction with a video search engine. Specifically, we introduce Soft Query Refinement (SQR) to refine search query tokens in a continuous latent space rather than rewriting queries in the discrete text space, enabling more efficient and fine-grained adjustments. SQR and its reasoning process are trained using Group Relative Policy Optimization (GRPO), guided by task-level reward signals derived from retrieval and downstream tasks. Building upon this, VideoSearch-R1 achieves state-of-the-art performance across three datasets on Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora, refining search queries, and performing precise query-conditioned temporal grounding within the retrieved content. Our analyses show that SQR effectively refines the original query, requiring significantly fewer generated tokens than explicit text-level query refinement. Code and model checkpoints are publicly available at this http URL.

103. 【2607.00434】Information-Regularized Attention for Visual-Centric Reasoning

链接https://arxiv.org/abs/2607.00434

作者:Guohao Sun,Xiaofang Wang,Yash Patel,Mengchen Liu,Zhiqiang Tao,Praveen Krishnan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:full-parameter instruction tuning, remain unstable due, weak visual grounding, object hallucination, instruction tuning

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Vision-language models (VLMs) have become a paradigm for multimodal learning, yet remain unstable due to object hallucination, weak visual grounding, and catastrophic forgetting after full-parameter instruction tuning. We claim these failures result from a lack of explicit control over visual representation learning during the standard next-token prediction objective. As a result, visual embeddings thus become passively optimized and prone to injecting redundant or spurious signals. To counter this, we introduce Information-Regularized Attention (IRA), a stochastic attention mechanism that explicitly regulates the amount of visual information injected into the hidden states of intermediate transformer layers. This local reparameterization translates uncertainty about visual representations into local noise that is independent across data points. Beyond evaluating model performance, we also quantify embedding properties, where IRA produces smoother curvature trajectories and suppresses attention-sink across all layers, indicating a more stable transformation of the visual signal. Our results suggest that stochastic attention is not merely a regularizer but a key contributor to representation learning in a generative architecture, offering a new direction for building more reliable VLMs.

104. 【2607.00428】HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding

链接https://arxiv.org/abs/2607.00428

作者:Ji Ha Jang,Hayeon Kim,Chulwon Lee,Junghun James Kim,Se Young Chun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrastive Language-Image Pre-training, Language-Image Pre-training, due to absolute, absolute positional encoding, facto paradigm

备注: Accepted to ECCV 2026. Project page: [this https URL](https://janeyeon.github.io/hyflclip)

点击查看摘要

Abstract:CLIP (Contrastive Language-Image Pre-training) has become a de facto paradigm for image-text alignment, but it struggles with long-context descriptions (77 tokens) due to absolute positional encoding and pretraining on short captions. In long contexts, sentences are often reordered, summarized, or partially omitted. Although prior works extend CLIP with longer positional encodings, they often suffer from degraded image-text alignment under such text perturbations. We attribute this limitation to the Euclidean contrastive objective, which enforces strict one-to-one matching and lacks explicit mechanisms for modeling hierarchical relationships between global context and its constituent elements. To address this issue, we propose HyFL-CLIP, a hyperbolic fine-tuning framework that distills the well-established text-image alignment learned in Euclidean CLIP into hyperbolic space via cross-manifold similarity distillation, leveraging its geometry to capture hierarchical and entailment relations. Our method models hierarchical semantics by linking summarized token-wise features, long-context descriptions, constituent short textual components, and images, capturing part-whole relationships via hyperbolic entailment with Einstein midpoint aggregation. Experiments on diverse benchmarks, including long-context cross-modal retrieval, cross-modal retrieval with caption perturbations, intra-modality retrieval, and short-text cross-modal retrieval, show that HyFL-CLIP achieves more robust long-context understanding. In particular, it yields up to 19.5% improvement in long-text cross-modal retrieval under textual perturbations over the best prior method. We also show HyFL-CLIP can be seamlessly integrated into other model frameworks by applying it to Stable Diffusion XL (SDXL).

105. 【2607.00417】EO-VGGT: Orbital Ray-Conditioned 3D Foundation Models for Satellite Multi-View Reconstruction

链接https://arxiv.org/abs/2607.00417

作者:Qiyan Luo,Yingdong Pi,Lekang Wen,Jie Yang,Xiaoyu Wang,Haiming Zhang,Mi Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-quality Digital Surface, Digital Surface Model, Digital Surface, multi-view optical satellite, pivotal for Earth

备注: This article is submitted to journal and under review

点击查看摘要

Abstract:In the era of satellite constellations, multi-view optical satellite imagery is pivotal for Earth Observation (EO) and high-quality Digital Surface Model (DSM) reconstruction. Although feed-forward 3D foundation models have transformed computer vision, their deployment in satellite remote sensing is inherently constrained by the structural discrepancy between implicit perspective assumptions and explicit orbital pushbroom geometry. This geometric incongruity is further compounded by pronounced view-set heterogeneity. We present EO-VGGT, a framework that adapts a frozen perspective-driven model to orbital observations via explicit physical geometry this http URL, the Geometry-Correlation Constrained Selection (GCCS) strategy prunes sub-optimal observations by balancing geometric diversity and radiometric consistency to optimize the input sequence. Second, a Sensor-Ray Encoder (SRE) parameterizes pixel-level pushbroom lines of sight derived from the Rational Function Model (RFM) into high-dimensional space-geometric tokens, reconciling the mathematical discrepancy between central projection and orbital kinematics. Third, a lightweight Ray-Pointing-Aware Adapter (RPAA) employs gated residual blocks to integrate these tokens directly into the frozen transformer backbone. Our findings underscore that integrating explicit physical geometry with optimized view selection is essential for robust feed-forward satellite 3D reconstruction.

106. 【2607.00416】DroneIQA-VLE: Multi-Task Drone Image Quality Assessment via Vision-Language Ensemble

链接https://arxiv.org/abs/2607.00416

作者:Wei Sun,Weixia Zhang,Hongjian Zhan,Mingkai Lu,Yixuan Gao,Guangtao Zhai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-altitude UAV Images, Image Quality Assessment, Target-aware Image Quality, Drone-IQA Grand Challenge, Target-aware Image

备注: The model achieves 2nd place in ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images

点击查看摘要

Abstract:We present DroneIQA-VLE, our solution to the ICME 2026 Drone-IQA Grand Challenge on Target-aware Image Quality Assessment for Low-altitude UAV Images. The framework jointly predicts global, target, and background quality scores by ensembling two complementary pipelines: (1) SigLIP2 vision encoders with multi-task regression heads, and (2) a LoRA-adapted Qwen3.5-9B multimodal large language model for quality score regression. The final global quality prediction is obtained by arithmetically averaging the outputs of both pipelines. Our method achieves 2nd place in the challenge, demonstrating its effectiveness. The code is available at this https URL.

107. 【2607.00410】MindAU: EEG-Conditioned Facial Action Unit Editing via Dual-Stream Manifold Alignment

链接https://arxiv.org/abs/2607.00410

作者:Zhenhang Li,Xin Zhou,Hao Deng,Lijun Yin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent brain decoding, brain decoding studies, made substantial progress, reconstructing externally perceived, Recent brain

备注

点击查看摘要

Abstract:Recent brain decoding studies have made substantial progress in reconstructing externally perceived visual content from neural signals. However, using electroencephalography (EEG) recordings to guide facial expression editing remains largely unexplored and poses a distinct challenge: rather than recovering what a subject sees, it requires identifying facial-action related patterns from noisy EEG signals and grounding them in localized, identity-preserving expression edits. In this paper, we investigate EEG-conditioned facial image editing for fine-grained facial action unit (AU) control and propose MindAU, a unified framework for controlling facial AU edits from EEG signals. MindAU first learns noise-robust and AU-discriminative EEG representations through temporal masked reconstruction and AU classification supervision. It then bridges the modality gap via Dual-Stream Manifold Alignment, aligning EEG features with AU-level text semantics and identity-reduced visual displacement trajectories in the multimodal space of Qwen2.5-VL. Finally, MindAU incorporates EEG-aware Multimodal Rotary Positional Embeddings, landmark-guided reference masking, and AU-aware region supervision into a multimodal diffusion-based editor for high-fidelity identity-preserving editing. We also introduce E-CAFE, a curated benchmark for EEG-Conditioned Action-Unit Facial Editing with paired EEG-face editing samples and standardized evaluation protocols. Extensive experiments demonstrate the effectiveness of MindAU and suggest its potential as a step towards future assistive expression technologies for individuals with facial neuromuscular disorders.

108. 【2607.00409】MedCAGD: Context-Aware Gated Decoder for Efficient Medical Image Segmentation

链接https://arxiv.org/abs/2607.00409

作者:Saad Wazir,Patrick Dominique Vibild,Dinh Phu Tran,Seongah Kim,Daeyoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical image segmentation, translate rich feature, accurate pixel-level predictions, image segmentation relies, structural ambiguity

备注: Accepted at the European Conference on Computer Vision (ECCV 2026)

点击查看摘要

Abstract:Medical image segmentation relies on the ability of encoder-decoder architectures to translate rich feature representations into accurate pixel-level predictions under challenging conditions such as low contrast, structural ambiguity, and scale variability. While recent advances in large-scale pretraining and transformer-based encoders have substantially improved feature extraction, segmentation accuracy remains constrained by decoder design, particularly in terms of cross-scale alignment, contextual integration, and boundary preservation. In this work, we revisit medical image segmentation from a decoder-centric perspective and propose a context-aware gated decoder that systematically regulates feature fusion and contextual aggregation throughout the decoding process. The proposed decoder integrates lightweight multi-scale channel recalibration, gated skip fusion with spatial competition and a global context aggregation mechanism that injects encoder-wide information into intermediate decoding stages. This design enables effective translation of strong pretrained encoder representations into spatially consistent predictions. Extensive experiments across 11 medical image segmentation benchmarks validate the effectiveness and demonstrate that the proposed approach consistently outperforms strong baselines while remaining computationally practical. Code: this https URL

109. 【2607.00402】he Illusion of High Utility in Safety Alignment of Text-to-Image Diffusion Models

链接https://arxiv.org/abs/2607.00402

作者:Adeel Yousaf,Soumik Ghosh,James Beetham,Amrit Singh Bedi,Mubarak Shah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:suppress harmful generations, diffusion models aims, aims to suppress, suppress harmful, harmful generations

备注: ECCV 2026

点击查看摘要

Abstract:Safety alignment of text-to-image (T2I) diffusion models aims to suppress harmful generations while preserving utility on benign prompts. Recent methods often appear to deliver high safety with high utility, but this conclusion rests largely on coarse global utility metrics (e.g., FID, CLIPScore) that are insensitive to fine-grained semantic correctness, creating an illusion of high utility. We show that when utility is measured with structured evaluation, this illusion breaks: on TIFA (Text-to-Image Faithfulness evaluation with Question Answering), safety-aligned models suffer substantial drops in semantic fidelity, including failures in object counts, attributes, and relationships. To diagnose the source of this gap, we analyze the text-encoder prompt embedding space and uncover semantic collapse, a contraction of embedding spread coupled with distortion of inter-prompt similarity structure, which strongly correlates with structured utility loss. Guided by this insight, we propose StructureAware Geometric Regularization (SAGE), a safety alignment objective that explicitly preserves embedding spread and inter-prompt relational structure during adaptation. Our method restores structured utility (TIFA +5.0% over prior state-of-the-art) while maintaining strong safety performance and competitive coarse-grained utility scores. Our source code and trained models are available at this https URL.

110. 【2607.00399】DriveVer: Lightweight Trajectory Evaluator as Test-Time Verifier for Autonomous Driving

链接https://arxiv.org/abs/2607.00399

作者:Chong He,Yuechen Luo,Fang Li,Shaoqing Xu,Fuxi Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diminishing marginal returns, training-time scaling leads, encounter performance bottlenecks, high computational costs, marginal returns

备注

点击查看摘要

Abstract:End-to-end autonomous driving models often encounter performance bottlenecks, as training-time scaling leads to high computational costs and diminishing marginal returns. Existing planners typically adopt a one-shot generation paradigm, lacking secondary validation and active correction mechanisms to detect and revise suboptimal or unsafe trajectories during inference. To address this issue, we propose DriveVer, a lightweight, plug-and-play Test-Time Verifier that leverages the test-time scaling paradigm to enable autonomous driving systems to validate and refine trajectories without costly and heavy training. We construct a dedicated trajectory dataset based on the NAVSIM benchmark through condition-driven clustering and balanced sampling according to ego-vehicle states and navigation commands. Employing a dual-head architecture, DriveVer efficiently fuses candidate trajectories with multi-view visual representations and ego-vehicle kinematic features to simultaneously predict a safety confidence score and an absolute geometric refinement vector. Extensive experiments on the NAVSIM benchmark show that DriveVer significantly improves the performance of base planning models. Notably, as an extremely compact model with only 34M parameters, DriveVer introduces minimal computational overhead, achieving competitive results while maintaining real-time inference efficiency.

111. 【2607.00382】Vitality-Aware Compression for Efficient Image-to-Shape Diffusion Transformers

链接https://arxiv.org/abs/2607.00382

作者:Jaeah Lee,Hyunjin Kim,Jaewoong Cho,Gihyun Kwon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformers, substantially reduces model, reduces model size, preserving geometric fidelity, substantially reduces

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:We propose the first compression approach for image-to-shape Diffusion Transformers (DiTs) that substantially reduces model size while preserving geometric fidelity. Despite remarkable progress in 3D shape generation, large DiT-based models remain computationally prohibitive in resource-constrained settings. Furthermore, it is difficult to directly transfer existing diffusion model compression strategies developed for different domains to 3D generation, and prior 3D efficiency approaches focus primarily on inference speed rather than backbone compression. To address this limitation, we build a geometry-aware compression framework tailored to image-to-shape DiTs. Guided by the observation that 3D DiT layers exhibit non-uniform importance for geometry synthesis, we introduce a vitality-guided framework integrating structured pruning, adaptive quantization, and targeted fine-tuning. Our method achieves up to 66% model-size reduction across state-of-the-art image-to-3D models while maintaining synthesis fidelity comparable to full-sized counterparts. This highlights the potential of our framework as a plug-and-play solution for efficient 3D shape generation across diverse models.

112. 【2607.00379】Attribute-Prompted Kernel Hashing for Unsupervised Data-Efficient Cross-Modal Retrieval

链接https://arxiv.org/abs/2607.00379

作者:Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically related instances, manual semantic annotation, hashing enables efficient, requiring manual semantic, enables efficient retrieval

备注

点击查看摘要

Abstract:Unsupervised cross-modal hashing enables efficient retrieval of semantically related instances across different modalities without requiring manual semantic annotation. However, existing unsupervised methods rely heavily on large-scale image-text pairs. Collecting such data can be costly, particularly in scenarios where well-aligned pairs are scarce due to privacy and specialized constraints. More critically, existing methods tend to overfit to seen training data, restricting their generalization performance on unseen categories that the constrained training data cannot cover. To address these limitations, we propose Attribute-Prompted Kernel Hashing (APKH), a novel data-efficient approach that constructs a compact, modality-aligned Hamming space driven by the generalized attribute priors of vision-language foundation models. Specifically, APKH introduces two core modules: Context-optimized Attribute Kernel Mapping (CAKM) and Kernel-Smoothed Contrastive Alignment (KSCA). CAKM formulates cross-modal alignment through hyperspherical Radial Basis Function kernel mapping, optimizing dynamic attribute kernels via prompt learning to capture modality-invariant semantics. Furthermore, KSCA extends conventional point-to-point contrastive learning by modeling limited paired data as continuous kernel distributions. This explicit smoothing of the modality gap alleviates overfitting to sparse pairwise correlations. Extensive experiments demonstrate that APKH outperforms state-of-the-art hashing methods in the challenging cross-modal retrieval tasks from seen to unseen categories under data-constrained scenarios.

113. 【2607.00378】Radial Interaction Tomography: Recognizing Non-Transitive Evolutionary Games from One Range-Expansion Image

链接https://arxiv.org/abs/2607.00378

作者:Faruk Alpay,Baris Basaran

类目:Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)

关键词:lineage survival counts, microbial range expansion, range expansion encode, Colored sectors, survival counts

备注: 17 pages, 10 figures. Ancillary files include computational diagnostics, benchmark code, and supplementary proofs

点击查看摘要

Abstract:Colored sectors in a microbial range expansion encode more than lineage survival counts. We formulate a computer-vision inverse problem: from one endpoint image of an accretive multi-type expansion, recover the radius-indexed pairwise boundary-flow field and test whether the visual pattern is compatible with a transitive scalar fitness hierarchy. The observable is a geometric signal extracted from sector-boundary curves in log-polar coordinates. We prove endpoint observability and stability for frozen fronts, weighted transitive/cyclic decomposition, contact-complete circular design, physical-clock and mechanism non-identifiability, exact Gaussian cyclicity testing, and Bonferroni-valid interval scanning. The benchmark is deterministic: analytic endpoint images, blurred/noisy pixel round trips, scalar-null stress tests, public-image tracing, multi-resolution mechanistic endpoints, and a non-learning frozen-front simulator. The implementation recovers pairwise edge-flow histories from endpoint images, detects cyclic residuals in a mechanistic four-type expansion, and uses those residuals as forcing signals for a dimensionless active design-control layer covering reaction-diffusion control, phenotype-frontier optimization, protocol synthesis, Monte Carlo robustness, and a downstream population-state bridge.

114. 【2607.00375】LIST3R: Long-sequence Instance-aware 3D Reconstruction

链接https://arxiv.org/abs/2607.00375

作者:Jing Gao,Wei Wang,Feiran Wang,Yan Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:humans organize spatial, organize spatial memory, recognizable objects, instance-aware framework, spatial memory

备注

点击查看摘要

Abstract:We present LIST3R, an instance-aware framework for long-sequence 3D reconstruction inspired by the way humans organize spatial memory around stable and recognizable objects. LIST3R organizes long-sequence reconstruction around instance anchors, using them to reconnect fragmented subsequences and consolidate local observations into a coherent global 3D scene. Given a long video, our approach partitions it into overlapping subsequences and builds a structured local instance library for each partial reconstruction, maintaining persistent trackable anchors with semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and provide object-aware constraints for fragment alignment, producing a consistent global reconstruction. During this process, the evolving geometric evidence updates the local instance libraries and progressively organizes them into a unified global 3D instance library. Experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions, highlighting the effectiveness of persistent instance anchors for organizing long-horizon 3D reconstruction. Our code is available on the project page: this https URL.

115. 【2607.00374】Learning to Compose: Revisiting Proxy Task Design for Zero-Shot Composed Image Retrieval

链接https://arxiv.org/abs/2607.00374

作者:Jingjing Zhang,Lei Zhang,Zheren Fu,Zhendong Mao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Composed Image Retrieval, Image Retrieval, reference image, Image, Composed Image

备注: Accepted by ECCV 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) retrieves a target image from a reference image and a textual modification. While supervised CIR relies on costly triplets, Zero-Shot CIR (ZS-CIR) alleviates this reliance through proxy tasks trained on image-text pairs. However, existing proxy tasks primarily enhance visual and textual representations to accommodate a predefined composition mechanism such as pseudo-word injection into a frozen text encoder or linear feature arithmetic. As a result, the composition function itself remains unlearned, limiting the model's ability to express diverse and fine-grained semantic modifications. To address this, we propose FoCo, which models composition as two coordinated stages: focusing on modification-relevant visual content, and then completing the target semantics. We realize these through two proxy tasks: text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics, and context-conditioned semantic completion to transform these aggregated visuals with the remaining scene context into a coherent composed representation. The tasks are trained jointly with a cross-instance contrastive objective, encouraging semantic diversity and discouraging shortcut composition strategies. Extensive experiments on four ZS-CIR benchmarks show FoCo's state-of-the-art performance and improved generalization.

116. 【2607.00371】MEPA: Multi-Scale Representation Alignment for Visual Autoregressive Modeling with Mixture of Experts

链接https://arxiv.org/abs/2607.00371

作者:Nuoyan Zhou,Zhijun Tu,Lei Yu,Kun Cheng,Jie Hu,Nannan Wang,Xinghao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrating strong capabilities, Visual AutoRegressive modeling, Visual AutoRegressive, demonstrating strong, multi-scale autoregressive generative

备注: 15 pages, 4 figures, 8 tables, Accepted at ECCV 2026

点击查看摘要

Abstract:Visual AutoRegressive modeling (VAR) has pioneered a coarse-to-fine multi-scale autoregressive generative paradigm, demonstrating strong capabilities in image generation. However, VAR still suffers from inherent deficiencies in multi-scale representation learning. Specifically, lower scales primarily capture global semantics, while higher scales focus on fine-grained details. Employing a shared architecture across scales induces optimization conflicts. Moreover, due to the causal autoregressive process, inaccurate semantics at early scales can propagate and significantly degrade the final output. To address these issues, we introduce a scale-aware token-routed Mixture of Experts (MoE) architecture, allowing scale-adaptive expert selection, thereby facilitating decoupled representation learning across scales. In addition, we enhance semantic modeling at early scales by incorporating external self-supervised features. Unlike naive alignment, we analyse and design a residual feature aggregation scheme tailored to the VAR paradigm. Extensive experiments show that our method significantly improves both training efficiency and generation quality. On the ImageNet 256*256 benchmark, our model achieves a superior FID compared to the dense baseline while requiring only half of the default training epochs and a smaller parameter budget, with a merely marginal increase in training cost. Moreover, the performance gap further widens with larger training epochs.

117. 【2607.00369】SFDATrack: Generalized Source-Free Domain Adaptive Tracking Under Adverse Weather Conditions

链接https://arxiv.org/abs/2607.00369

作者:Siyuan Yao,Ziqi Wang,Ruiqi Yu,Junqi Huang,Wenqi Ren,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual object tracking, garnered significant attention, adaptive visual object, Domain adaptive visual, recent years

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Domain adaptive visual object tracking under adverse weather conditions has garnered significant attention in recent years. Despite the impressive performance, existing methods heavily rely on the large-scale video frames from both source and target domains, which is impractical under rigid resource constraints where source data is unavailable. To overcome this limitation, we propose SFDATrack, a generalized source-free domain adaptive tracker that merely leverages adverse weather samples from the target domain for robust state estimation. Specifically, SFDATrack first employs a mean-teacher backbone with Dual Interactive Mamba (DIM) blocks to distill the candidate target tokens that are resilient to weather variations from classified, augmented samples. Afterwards, we introduce a hyperspherical prototype projection (HPP) module to project these tokens onto multi-domain prototypes within a latent hyperspherical space. By enforcing both domain-specific and domain-invariant properties of the multi-domain prototypes, SFDATrack can be seamlessly adapted to diverse weather conditions with powerful generalizability. Extensive experiments evaluated on various benchmarks demonstrate that SFDATrack achieves superior performance compared to state-of-the-art approaches. The code is available at this https URL.

118. 【2607.00357】Personalized Object Identification and Localization via In-Context Inference with Vision-Language Models

链接https://arxiv.org/abs/2607.00357

作者:Kensuke Nakamura,Byung-Woo Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:object, target object label, bounding-box annotations, target object, object instance

备注

点击查看摘要

Abstract:Personalized object localization (POL) localizes an object instance in a query image based on a few reference images with bounding-box annotations and a target object label. The pioneering method, IPLoc, solves this task through in-context inference with vision-language models (VLMs). However, it assumes that the query image always contains the target object. This assumption severely limits its applicability to real-world scenarios with many irrelevant images. To address this issue, we formulate a new task, personalized object identification and localization (POIL), by positioning POL within the broader few-shot object detection framework. POIL aims to localize the target object instance while rejecting query images that do not contain the reference object instance. We also present POIL datasets constructed from public sources. We further propose an in-context algorithm named IPLoc-ID for solving POIL with VLMs. IPLoc-ID first predicts a candidate bounding box and then determines whether it corresponds to the reference object instance. We introduce a self-posed query to connect these two steps within a single autoregressive generation framework. Through ablation studies and comprehensive experiments, we show that IPLoc-ID substantially suppresses false-positive detections on negative query images while maintaining localization performance comparable to IPLoc. Overall, IPLoc-ID effectively addresses the practical instance-level POIL task, which cannot be sufficiently solved by conventional object detection, few-shot object detection, or the localization-only IPLoc method.

119. 【2607.00338】DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

链接https://arxiv.org/abs/2607.00338

作者:Ke Wu,Yanan Zhang,Yingjie Gao,Wenhao Li,Chenyu Zhou,XinZhu Ma,Jiaxin Chen,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicles, highly challenging task, Aerial Vehicles, Unmanned Aerial, universal object detection

备注: Accepted by ECCV2026

点击查看摘要

Abstract:Object detection for Unmanned Aerial Vehicles (UAVs) working in open and dynamic environments is a highly challenging task. While Vision-Language Models (VLMs) have offered a powerful solution for universal object detection, adapting them to UAV scenarios remains non-trivial due to a substantial domain gap between VLM pre-training data and aerial imagery. The prevailing Parameter-Efficient Fine-Tuning (PEFT) methods prove ineffective in bridging this gap, as VLMs' "natural-scene, foreground-dominant" visual priors misalign with the "bird's-eye-view, background-dominant, small-object" characteristics of UAV data. To address this issue, we propose DroneFINE, a novel PEFT paradigm comprising two domain-aware complementary modules tailored for VLM-based drone image detectors. Specifically, a data-dependent, foreground-aware, and multi-path adaptation mechanism named HyperAdapter is designed, which overcomes the static structural constraints of PEFT. In addition, a background suppression algorithm named SemanticGate is developed. It is a text-conditioned guidance strategy that employs background vocabulary to actively guide the model in suppressing responses from irrelevant regions. Extensive experiments on VisDrone and UAVDT demonstrate that DroneFINE significantly outperforms existing PEFT methods and achieves performance comparable to full fine-tuning while substantially reducing the number of trainable parameters.

120. 【2607.00321】CORGI: Consistency-Aware 3D Dog Reconstruction from a Single Image in the Wild

链接https://arxiv.org/abs/2607.00321

作者:Yuxiao Wu,Weile Li,Boyi Zhu,Yumeng Liu,Youcheng Cai,Ligang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly articulated animals, Reconstructing high-fidelity, models of highly, articulated animals, formidable challenge

备注

点击查看摘要

Abstract:Reconstructing high-fidelity 3D models of highly articulated animals, such as dogs, from a single in-the-wild image remains a formidable challenge. In this paper, we introduce CORGI, a novel framework for consistency-aware 3D dog reconstruction from a single unconstrained image that completely eliminates the need for 3D supervision. To overcome generative inconsistencies and the lack of multi-view capture, our pipeline introduces three core components. First, we propose a Canonical-Driven Orbital Generation (CDOG) strategy, utilizing specialized Canonical and Orbit LoRAs to normalize arbitrary input poses and synthesize reliable 360-degree video observations. Second, we design a Consistency-aware Deformable 3DGS (CA-3DGS) module that anchors on a D-SMAL prior, explicitly modeling per-view generative errors through dedicated neural deformation fields to learn accurate vertex-level displacements. Finally, to eliminate structural distortions and recover high-frequency details, we introduce a self-supervised Deformation-Conditioned Generative Repair (DCGR) module. Extensive experiments demonstrate that CORGI achieves state-of-the-art performance, generalizing seamlessly across diverse dog breeds to produce geometrically accurate, visually coherent, and fully animatable 3D assets ready for downstream applications.

121. 【2607.00319】ypography-Based Monocular Distance Estimation for Advanced Driver-Assistance Systems

链接https://arxiv.org/abs/2607.00319

作者:Manognya Lokesh Reddy,Zheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:adaptive cruise control, automated emergency braking, forward collision warning, distance, adaptive cruise

备注: 23 pages, 11 figures

点击查看摘要

Abstract:Estimating the distance to a leading vehicle is a basic input to forward collision warning, adaptive cruise control, and automated emergency braking. Production systems obtain this distance from radar, laser scanners, or stereo camera pairs, which add cost, power draw, and packaging constraints. This paper asks whether a single ordinary camera can recover the same distance by using a target that is standardized in size and present on every road vehicle: the rear license plate. U.S. plates share a fixed outer size and a character height that is set by regulation and varies only narrowly between states, so the height of a plate character in the image is a direct measure of distance once the camera geometry is known. The proposed method (Typography-Based Monocular Distance Estimation) detects the plate, measures the height of its printed characters, identifies the issuing state to select the correct physical character height, and recovers distance from the camera projection. Three measurements taken from the same plate: the character height, the stroke width, and the character spacing. Together with the spacing of the two mounting holes and a single-image depth network, are combined so that a weak or corrupted measurement is given less weight automatically. The distance, its rate of change, and a time-to-collision estimate are smoothed across frames and used to raise a warning with the timing used by U.S. collision-warning regulations. The same plate that anchors the scale also identifies the vehicle, so the method returns a distance, a bearing, and an identity from one passive sensor. It reads scale from a printed standard instead of from time of flight or parallax, making it a cheap, low-maintenance complement to those sensors in a fault-tolerant perception stack, achieving the cost-effective distance estimation with error less than 0.13 m.

122. 【2607.00310】RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail

链接https://arxiv.org/abs/2607.00310

作者:Amirreza Rouhi,Rajat Aggarwal,Parikshit Sakurikar,Anoop M. Namboodiri,Sashi P. Reddi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-world deployment domains, Foundation video diffusion, internet-scale generic video, generic video leaves, foundation video world

备注

点击查看摘要

Abstract:Foundation video diffusion models are increasingly viewed as world simulators for embodied agents, yet their pretraining on internet-scale generic video leaves them poorly aligned with real-world deployment domains. We study parameter-efficient adaptation of a pretrained foundation video world model to retail scenes: when synchronized egocentric and exocentric video of the same activity are available, which viewpoint of training data produces the strongest adapted model? We introduce RetailSMV (Retail Synchronized Multi-View), a corpus of 32,105 captioned retail clips from five supermarkets with synchronized ego/exo capture from the store-staff perspective (stocking, arranging, weighing, managing supply carts, scanning at checkout), rather than the customer-centric framing of prior retail video corpora, and train three matched Low-Rank Adaptation (LoRA) configurations of Cosmos3-Nano (egocentric-only, exocentric-only, combined) under identical hyperparameters. On a 200-clip held-out test set evaluated with seven complementary metrics under a strict paired statistical protocol, exocentric-only adaptation matches or exceeds combined adaptation on six of seven point estimates and is significantly better on LPIPS, PSNR, and DreamSim, despite training on only 15,985 exocentric clips (versus 32,105 for combined). A symmetric paired comparison further shows that adding exocentric data to egocentric-only training helps while adding egocentric data to exocentric-only training hurts. The absolute adaptation gap is largest at the shortest rollout time, identifying the near-horizon prediction window as the regime in which adaptation is most beneficial.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2607.00310 [cs.CV]

(or
arXiv:2607.00310v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2607.00310

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2607.00302】Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs

链接https://arxiv.org/abs/2607.00302

作者:Yoonhyung Park,Minji Kim,Sungwon Moon,Jiyoung Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Robotics (cs.RO)

关键词:intrinsic material properties, physical grounding needed, perceive intrinsic material, Touch supplies, material properties

备注: ECCV 2026, Project page: [this http URL](http://mmai.ewha.ac.kr/splash/)

点击查看摘要

Abstract:Touch supplies the physical grounding needed to perceive intrinsic material properties, such as friction and compliance, that vision alone often cannot resolve. Recent efforts for equipping multimodal LLMs with this tactile sense, however, expose a zero-sum trade-off: the limited parameter budget of compact models forces a choice between acquiring the new sensory modality and preserving the established vision-language reasoning. We present Splash, a mask-isolated tactile alignment learning framework for MLLMs. Splash quantifies the significance of each pretrained parameter, and partitions the parameter space into a dormant and critical subspace. While the frozen critical subspace acts as a stable anchor to safeguard general visual knowledge, Splash updates the isolated dormant subspace to internalize tactile alignment towards LLMs. This selective, non-destructive expansion effectively prevents catastrophic forgetting and ensures non-destructive modality expansion. Extensive experiments show that Splash effectively achieves tactile reasoning without additional inference overhead in the LLM part, demonstrating state-of-the-art performance on visuo-tactile benchmarks, including SSVTP, TVL, and TacQuad, while preserving its original general-purpose capabilities.

124. 【2607.00296】Learning When to Listen: Gated Affect Fusion for Human Motion Prediction

链接https://arxiv.org/abs/2607.00296

作者:Jingni Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-world videos remains, videos remains challenging, remains challenging due, noisy multimodal observations, unconstrained real-world videos

备注

点击查看摘要

Abstract:Human motion forecasting in unconstrained real-world videos remains challenging due to the ambiguity of future behaviors and the presence of noisy multimodal observations. While facial affect potentially provides complementary behavioral cues, its practical utility and mechanistic boundaries within motion forecasting frameworks remain poorly understood. In this work, we present a systematic study investigating the utility and temporal limitations of affect-conditioned forecasting in-the-wild. We establish a rigorous multimodal pipeline combining MediaPipe body pose trajectories with HSEmotion facial affect representations, and introduce the Gated Affect Transformer (GAT) to dynamically regulate cross-modal information flow. Through extensive multi-horizon evaluations under a strict subject-wise protocol, we demonstrate that naive early cross-modal concatenation consistently degrades forecasting accuracy relative to pose-only baselines. Conversely, our proposed gating mechanism stabilizes cross-modal integration by adaptively controlling the affective stream. Crucially, controlled counterfactual experiments using shuffled and randomized affect inputs reveal that the learned gate successfully suppresses unstructured cross-modal noise while remaining responsive to plausible affective signals. Furthermore, our empirical results indicate that facial affect features provide bounded, horizon-dependent predictive cues strictly within short-to-medium windows (e.g., 30 frames), whereas long-term trajectories remain predominantly governed by intrinsic kinematic continuity. Our findings provide empirical evidence that facial affect should be regarded as a complementary behavioral cue rather than a dominant driver of future motion, offering practical guidance for selective multimodal fusion in unconstrained human motion forecasting.

125. 【2607.00293】Rosetta: Composable Native Multimodal Pretraining

链接https://arxiv.org/abs/2607.00293

作者:Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Achieving true artificial, true artificial general, artificial general intelligence, general intelligence requires, Achieving true

备注

点击查看摘要

Abstract:Achieving true artificial general intelligence requires foundation models capable of integrating new modalities without forgetting prior knowledge. However, accommodating continuous generative objectives alongside discrete understanding tasks causes severe gradient conflicts. Existing architectures, including standard Mixture-of-Experts (MoE), are highly susceptible to representation overwriting. Even structurally partitioned paradigms like Mixture-of-Transformers (MoT) remain vulnerable to catastrophic forgetting, severely impeding multimodal scalability. In this work, we introduce Rosetta, a composable native multimodal pretraining framework designed for seamless and non-destructive modality expansion. Rosetta adopts a modular paradigm where core foundational knowledge is preserved within global shared experts, while modality-specific capabilities are distributed across plug-and-play experts. To guarantee non-destructive composition, we propose Momentum-Anchored Orthogonal Projection (MAOP). MAOP leverages the optimizer's momentum state as an implicit semantic anchor, selectively neutralizing conflicting gradient components from new modalities while preserving synergistic updates. Extensive evaluations demonstrate that, while standard MoE and MoT architectures suffer catastrophic forgetting of previously acquired knowledge, Rosetta robustly preserves established language and visual understanding. Furthermore, it delivers superior image generation and unlocks cross-modal synergy, paving the way for truly composable and unified multimodal foundation models. To facilitate further multimodal research, we release our code and checkpoints to the community. Project page at this https URL.

126. 【2607.00289】OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization

链接https://arxiv.org/abs/2607.00289

作者:Sakib Reza,Gauri Jagatap,Mohsen Moghaddam,Octavia Camps,Andrea Fanelli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Temporal Action Localization, Action Localization, Point-Supervised Online TAL, typically relies, limiting scalability

备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Temporal Action Localization (TAL) typically relies on segment annotations or offline access to full videos, limiting scalability and online use. We introduce Point-Supervised Online TAL (POTAL), which localizes actions in streaming videos using only one temporal point per instance. To solve POTAL, we propose OnPoint, an offline-to-online multi-level distillation framework that transfers knowledge from a point-supervised offline teacher to an online student via (i) pseudo-segment instance distillation, (ii) class-activation sequence distillation, and (iii) anticipatory window-level distillation. We further improve robustness by incorporating the original point labels into student training and by refining anchor decoding with actionness-guided attention calibration. Experiments on five datasets show OnPoint consistently outperforms strong baselines, establishing a solid foundation for POTAL.

127. 【2607.00283】What's Hidden Matters: Identifying Planning-Critical Occluded Agents using Vision-Language Models

链接https://arxiv.org/abs/2607.00283

作者:Amirhosein Chahe,Tyler Naes,Jovin D'sa,Faizan M. Tariq,Sangjae Bae,Lifeng Zhou,David Isele

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:safely navigate complex, navigate complex environments, safely navigate, navigate complex, complex environments

备注: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026). 9 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Autonomous vehicles must safely navigate complex environments where planning-critical agents may be hidden from view. Current approaches often treat all occlusions with uniform conservatism, yielding needlessly defensive driving, or they infer hidden spaces without estimating the impact on the planner. This work bridges the critical gap between perception and planning by enabling Vision-Language Models (VLMs) to identify and reason about the specific hidden agents that are most critical to the ego-vehicle's trajectory. We introduce a novel framework that uses Planning KL-divergence (PKL), an information-theoretic metric, to systematically identify and rank occluded agents based on their impact on the ego vehicle's plan. Using this planning-aware ranking, we employ an expert VLM (GPT-5) to generate rich, structured annotations that capture the visual evidence and reasoning required for this task. We apply this framework to the nuScenes dataset to create a new benchmark focused on high-impact scenarios. We conduct comprehensive experiments on a wide range of general-purpose and domain-adapted VLMs, demonstrating that fine-tuning on our PKL-guided data yields dramatic performance improvements across all models. Notably, our results show that smaller, fine-tuned models significantly outperform their much larger zero-shot counterparts, and that our PKL-guided data selection strategy improves performance by approximately 30\% over random sampling. Our work presents the first systematic approach for training VLMs to focus on planning-critical occlusions, enabling more semantically grounded and efficient risk assessment in autonomous driving.

128. 【2607.00277】AEGIS: A Multi-Task Joint-Embedding Predictive Architecture for Mammography

链接https://arxiv.org/abs/2607.00277

作者:Scott Chase Waggener,Sai Karthik Navuluru,Lakshman Tamil

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:joint-embedding predictive architecture, present Aegis, Vision Transformer variants, assessment in mammography, breast cancer detection

备注

点击查看摘要

Abstract:We present Aegis, a joint-embedding predictive architecture for breast cancer detection and density assessment in mammography. We train three Vision Transformer variants (Small/Base/Large) using self-supervised joint-embedding predictive architecture (JEPA) pre-training on 71,103 studies from 14 clinical sites, followed by supervised fine-tuning with progressive resolution scaling up to 2048x1536. On a curated 785-study test set, our largest model achieves area under the receiver operating characteristic curve (AUC) 0.949 for breast cancer triage with 93% sensitivity and 75% specificity at the optimal operating point. An ensemble combining our model with a U.S. Food and Drug Administration-cleared baseline further improves discrimination to 0.952 AUC. For breast density classification, the model achieves 0.953 AUC for binary (dense vs. non-dense) classification and 62.6% exact accuracy across four Breast Imaging Reporting and Data System (BI-RADS) categories, with 98.8% adjacent accuracy comparable to reported human inter-reader agreement. External validation on the public VinDr-Mammo dataset provides evidence of cross-population transfer under a different reference standard, with the largest model achieving 0.871 AUC for triage in a zero-shot setting.

129. 【2607.00273】MVDGC: Joint 3D and 2D Multi-view Pedestrian Detection via Dual Geometric Constraints

链接https://arxiv.org/abs/2607.00273

作者:Thinh Phan,Hao Vo,Khoa Vo,Thanh Ngo,Cuong Pham,Ngan Le

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robust occlusion reasoning, Bird Eye View, BEV, occlusion reasoning, core challenge

备注

点击查看摘要

Abstract:The core challenge in multi-view pedestrian detection (MVPD) lies in effective aggregation of visual features from different viewpoints for robust occlusion reasoning. Recent approaches have addressed this by first projecting image-view features onto a Bird's Eye View (BEV) map, where ground localization is then performed. Despite impressive performance, the perspective transformation induces severe distortion, causing spatial structure break and degrading the quality of object feature extraction. The blurred and ambiguous features hinder accurate BEV point localization, especially in densely populated regions. Moreover, the strong mutual relationship between the BEV ground point and image bounding boxes is not capitalized on. Although multi-view consistency of 2D detections can serve as a powerful constraint in BEV space, these detections are commonly treated as auxiliary signals rather than being jointly optimized with the primary this http URL this work, we propose \textbf{MVDGC}, a unified framework that \emph{jointly estimates pedestrian locations on the BEV plane and 2D bounding boxes in image views}. MVDGC employs a \emph{sparse set of 3D cylindrical queries} that embraces geometric context across both BEV and image views, enforcing dual spatial constraints for precise localization. Specifically, the geometric constraints is established by modeling each pedestrian as a vertical cylinder whose center lies on the BEV plane and whose projection casts a rectangular box in the image views. These queries function as shape anchors that directly extract 2D features from the intact image-view features using camera projection, eliminating projection-induced distortions. The 3D cylindrical query enables the unification of BEV and ImV localization into a single task: 3D cylinder position and shape refinement. Code is available at: this https URL

130. 【2607.00259】Multi-Hypothesis Test-Time Adaptation to Mitigate Underspecification

链接https://arxiv.org/abs/2607.00259

作者:Afshar Shamsi,Xiao-Yu Guo,Hamid Alinejad-Rokny,Arash Mohammadi,Damien Teney,Ehsan Abbasnejad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:unlabeled target data, improve model robustness, seeks to improve, target data, improve model

备注: 26 pages, 4 figures, 12 tables, Accepted in ECCV'26

点击查看摘要

Abstract:Test-Time Adaptation (TTA) seeks to improve model robustness under distribution shifts by adapting parameters using unlabeled target data. However, in the absence of supervision, entropy-based adaptation is fundamentally underconstrained: multiple distinct parameter updates can achieve similarly low entropy while inducing drastically different decision boundaries. This phenomenon, known as underspecification, renders standard TTA brittle and prone to collapse into spurious modes. In this work, we reinterpret TTA through a posterior-inspired lens induced by entropy minimization, where low-entropy solutions define a pseudo-likelihood over parameters. Instead of committing to a single point estimate, we introduce a particle-based diversification framework that explores multiple plausible adaptation trajectories simultaneously. Our method can be viewed as a structured exploration of multiple plausible adaptation solutions, implemented through multi-level diversification at the output, parameter, optimizer, and input levels. Crucially, the framework acts as a plug-and-play wrapper compatible with existing TTA methods. Extensive experiments on challenging benchmarks demonstrate consistent gains in stability and robustness, achieving improvements of 3-4% under mixed shifts, 2-3% with batch size one, and 1-2.5% under label shifts, outperforming state-of-the-art baselines. Our results suggest that treating TTA as a multi-hypothesis inference problem, rather than a single-point optimization task, is key to mitigating underspecification and enabling reliable real-world deployment.

131. 【2607.00251】Leveraging Phase Information to Boost Unrolled Network Learning for Image Deblurring

链接https://arxiv.org/abs/2607.00251

作者:Samira Malek,Haichuan Zhang,Chul Lee,Vishal Monga

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:techniques directly restore, accurate phase estimation, spatial image variable, phase decomposition recognizing, deblurring techniques directly

备注

点击查看摘要

Abstract:While most image deblurring techniques directly restore the spatial image variable, we propose an amplitude and phase decomposition recognizing the importance of accurate phase estimation in recovering sharp image details. To that end, we first develop novel linear minimum mean squared (LMMSE) estimators of the amplitude and phase of the blurred, noisy image observation. An iterative optimization algorithm follows that recovers the sharp image using the aforementioned LMMSE estimators. Finally, matrix parameters that are statistically determined and fixed in the iterative algorithm are now learned using a training dataset of clean and degraded observations. Our deblurring engine is dubbed UPADNet (Unrolled Phase and Amplitude Decomposition Network), such that each iteration of the underlying phase and amplitude recovery algorithm is parameterized and trained end-to-end. Experiments over benchmark evaluation datasets such as GoPro, RealBlur and COCO datasets confirm that UPADNet outperforms state of the art deep networks including those based on algorithm unrolling in the image domain. The benefits of UPADNet are even more pronounced in high noise and limited training data regimes.

132. 【2607.00250】LV-ROVER: Multi-Stream Tesseract Voting for Maltese Paragraph OCR

链接https://arxiv.org/abs/2607.00250

作者:Adam Darmanin

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:real labelled PDF, labelled PDF corpus, large OCR benchmarks, pretrained language models, decent text corpora

备注: 8 pages, 1 figure, 3 tables. System paper for the DocEng 2026 Maltese Paragraph OCR Competition

点击查看摘要

Abstract:Maltese has decent text corpora and pretrained language models, but, like many languages outside the handful with large OCR benchmarks, only a single known real labelled PDF corpus for OCR training, 57 page, far below what paragraph-level training needs: low-resource for OCR specifically. With no real corpus to train on at scale, we built a synthetic training pipeline and a 5-stream Tesseract LV-ROVER ensemble, and report results on a 422-paragraph benchmark against a fine-tuned-Tesseract baseline of character error rate (CER) 0.0234. Ensemble recognition alone improves CER by 44 percent, to 0.01317; a five-stage post-processing chain brings the full pipeline to CER 0.00700, a 70 percent reduction. Most of that chain is typographic normalisation, but one stage recovers misread diacritics rather than aligning punctuation, so we report it as a recognition gain rather than folding the whole chain under one label. We treat the 44 percent figure as the portable estimate of what the recogniser learned, and the 70 percent figure as specific to this benchmark's label convention.

133. 【2607.00223】Does Your ViT Still Need U-Net for Segmentation?

链接https://arxiv.org/abs/2607.00223

作者:Xin Li,Wenhui Zhu,Xuanzhao Dong,Xiwen Chen,Yanxi Chen,Yujian Xiong,Hao Wang,Oana M. Dumitrascu,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical image segmentation, Medical image, image segmentation, segmentation, Medical

备注: 8 pages, 4 figures

点击查看摘要

Abstract:Medical image segmentation is dominated by U-Net-style encoder-decoder architectures. Vision Transformers (ViTs) overcome the limited receptive field of convolutional networks through self-attention, enabling modeling of long-range dependencies. Early ViT-based segmentation methods typically retained U-Net-style decoders because pretrained ViT representations were insufficient to support accurate dense prediction. Recent advances in large-scale pretraining have redefined the representation capability of ViTs, reducing the reliance on U-Net-style decoder architectures in modern vision models. This prompts two questions: Is the U-Net paradigm still necessary for medical image segmentation? If not, how should an encoder-only segmentation framework be designed? Motivated by these questions, we explore key architectural choices for encoder-only medical image segmentation based on modern ViT backbones and establish a query-based encoder-only design with multi-level query modeling and learnable block fusion, realized in Encoder-only Segmentation (EoSeg). Extensive experiments across seven benchmark datasets spanning CT, MRI, histopathology, endoscopy, and dermoscopy validate the effectiveness of the proposed design across diverse medical imaging modalities, including mDice scores of 85.50% on Synapse, 91.73% on ACDC, and 93.27% on GlaS. The results demonstrate that a U-Net-style decoder is no longer necessary for medical image segmentation with modern ViT backbones and further show that EoSeg provides an effective encoder-only design. Code is available at: this https URL

134. 【2607.00218】EgoSafetyBench: A Diagnostic Egocentric Video Benchmark for Evaluating Embodied VLMs as Runtime Safety Guards

链接https://arxiv.org/abs/2607.00218

作者:Siddhant Panpatil,Arth Singh,Mijin Koo,Chaeyun Kim,Haon Park,Dasol Choi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:homes and factories, Vision-language models, proposed as runtime, embodied agents, agents in homes

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are now proposed as runtime safety guards for embodied agents in homes and factories. A deployable guard must catch genuinely unsafe situations while avoiding unnecessary intervention on routine but superficially alarming activity, a distinction that binary safety benchmarks obscure. We introduce EgoSafetyBench, an egocentric video benchmark of 1,200 robot-view scenarios annotated at half-second granularity, to evaluate VLMs as streaming guards across two tracks. The situational track (800 scenarios) spans four families, from routine and safe-but-suspicious scenes to obvious and contextual hazards. The visual-channel track (400 scenarios) targets in-scene text-a sign, sticker, or label visible in the scene-that can misrepresent the physical situation, pairing each misleading sign with a truthful version to test both whether a guard flags the text as misleading and whether the text corrupts its physical-safety judgment. Both tracks use contrastive ladders: near-identical scenarios differing only in a single visible deciding cue, so a correct call must hinge on that cue rather than the overall scene type. We evaluate ten open- and closed-source VLMs. We find that while guards reliably recognize videos containing hazards, they often miss specific hazardous moments, particularly contextual hazards. Furthermore, misleading in-scene signs degrade all tested guards: vulnerable models miss up to a third of hazards, while robust models over-intervene on safe content. Matched controls reveal that apparent safety robustness often reflects indiscriminate alarming rather than true physical reasoning.

135. 【2607.00201】rust the Prior (or Not): Uncertainty-Aware Abdominal Aortic Aneurysm Segmentation

链接https://arxiv.org/abs/2607.00201

作者:Erich Robbi,Daniele Ravanelli,Andrea Passerini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Abdominal Aortic Aneurysm, Aortic Aneurysm, surrounding non-enhanced tissues, Abdominal Aortic, heterogeneous thrombus features

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Robust segmentation of intraluminal thrombus is critical for risk assessment in Abdominal Aortic Aneurysm, yet it remains challenging due to heterogeneous thrombus features and low contrast with surrounding non-enhanced tissues. Domain shifts induced by different Computed Tomography Angiography (CTA) protocols further inhibit multi-center generalization of deep learning models. To address these challenges, we propose a patient-specific framework that integrates discriminative learning with anatomically informed priors. Our approach introduces two key components: (1) a patient-specific intensity normalization based on a Gaussian Mixture Model of local anatomy, and (2) an Uncertainty-Gated Anatomical Attention module that incorporates spatial priors while adaptively modulating their influence according to voxel-wise confidence. This design allows for anatomical guidance in ambiguous regions while suppressing unreliable priors. The proposed method achieves state-of-the-art performance on in-distribution test data and substantially outperforms existing alternatives in generalization to external multi-center CTA data, while remaining interpretable through an explicit separation of visual and anatomical evidence.

136. 【2607.00191】HydraCollab: Adaptive Collaborative-Perception for Distributed Autonomous Systems

链接https://arxiv.org/abs/2607.00191

作者:Luke Chen,Cheng-Ju Wu,David R. Martin,Qilin Ye,Pramod Khargonekar,Mohammad Abdullah Al Faruque

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:enhance situational awareness, enables multi-robot systems, sharing perceptual information, Collaborative-perception enables multi-robot, enables multi-robot

备注: Accepted at IROS 2026

点击查看摘要

Abstract:Collaborative-perception enables multi-robot systems to enhance situational awareness by sharing perceptual information. Existing collaborative-perception systems face an inherent trade-off between communication bandwidth requirements and perception accuracy, where methods that exchange more information achieve better perception results at the cost of increased communication overhead. However, real-world communication networks impose bandwidth constraints that require minimizing communication overhead without sacrificing perception performance. To address this challenge, we propose HydraCollab, an adaptive collaborative-perception framework that (i) selectively transmits the most informative sensor features and (ii) dynamically employs collaboration strategies (intermediate or late) based on spatial confidence maps. Extensive evaluations on the V2X-R, V2X-Radar and UAV3D-mini datasets demonstrate that HydraCollab achieves the best overall trade-off between accuracy and communication cost among existing collaborative-perception methods. Relative to SOTA Where2comm, HydraCollab uses only 41% of the bandwidth on V2X-R and 26% on V2X-Radar while improving performance by 0.78% and 0.75% respectively. Our code and models are available at this https URL.

137. 【2607.00189】VOCA: Visual Odometry with Codec Awareness

链接https://arxiv.org/abs/2607.00189

作者:Nouri Alexander Hilscher,Mateo de Mayo,Dominik Muhle,Christoph Otten genannt Hermes,Daniel Cremers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Camera pose estimation, spatial world models, Camera pose, planning and decision-making, pose estimation

备注: Accepted to ECCV 2026

点击查看摘要

Abstract:Camera pose estimation from image streams is a critical component of spatial world models that integrate perception into planning and decision-making. Nearly all Visual Odometry (VO) and Simultaneous Localization and Mapping (V-SLAM) systems have focused on datasets containing raw, uncompressed videos. Many working systems instead use ubiquitous hardware units to efficiently compress and decode video streams, saving orders of magnitude in storage and bandwidth. However, this lossy compression introduces visual artifacts that hinder the performance of traditional tracking systems. We present VOCA, a causal stereo visual-odometry method that exploits codec information to improve tracking performance. We achieve state-of-the-art performance on causal VO for relative trajectory error, efficiency, and absolute trajectory error on compressed streams. This work highlights the potential of leveraging widely available video codec information for vision tasks.

138. 【2607.00183】DriftScope: Measuring The Hidden Effects of Diffusion Model Adaptation

链接https://arxiv.org/abs/2607.00183

作者:Héctor Laria,Yiping Han,Julian D. Santamaria,Kai Wang,Bogdan Raducanu,Joost van de Weijer,Alexandra Gomez-Villa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adapting pre-trained, FID and KID, erase unwanted, routinely evaluated, intended effects

备注: 22 pages, 5 figures, Accepted at ECCV 2026

点击查看摘要

Abstract:Adapting pre-trained text-to-image diffusion models, whether to learn new visual concepts or erase unwanted ones, is routinely evaluated on its intended effects alone. We argue this framing is incomplete. Through sparse autoencoder analysis and zero-shot classification, we demonstrate that adaptation systematically damages semantically unrelated concepts in ways that aggregate metrics structurally cannot surface: when damage is severe enough for FID and KID to respond, the model is already nearly unusable; when the model remains functional, FID and KID stay flat while specific classes silently suffer worst-case zero-shot accuracy drops of up to 18.9 points and concept-level distributions shift dramatically. This pattern appears at both ends of the adaptation spectrum (concept customization and concept unlearning), suggesting it is a systematic consequence of weight-level modification rather than an artifact of any particular method. To surface this hidden drift before deployment, we introduce DriftScope, a prompt-level diagnostic tool that takes any two model checkpoints and returns a ranked list of tokens whose visual concepts have shifted most between them. DriftScope optimizes a soft prompt to attribute drift at the token level without requiring access to real data or model internals. The result is an interpretable, concept-level audit that aggregate evaluation cannot provide.

139. 【2607.00176】PRISM-VO: Scale-Aware Visual Odometry Using Photometric Plenoptic Bundle Adjustment

链接https://arxiv.org/abs/2607.00176

作者:Aymeric Fleith,Julian Zirbel,Daniel Cremers,Niclas Zeller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pure optimization-based sparse, optimization-based sparse photometric, sparse photometric visual, pure optimization-based, optimization-based sparse

备注: Accepted for publication at the 19th European Conference on Computer Vision (ECCV) 2026

点击查看摘要

Abstract:We introduce PRISM-VO, a novel pure optimization-based sparse photometric visual odometry framework for focused plenoptic cameras. The core of PRISM-VO is a novel photometric plenoptic bundle adjustment which jointly optimizes camera poses and inverse depth values of points in a sliding window. By combining geometric depth from a single plenoptic image with temporal multi-view constraints, PRISM-VO achieves accurate and drift-resilient motion estimation. Through explicit modeling of the plenoptic projection, PRISM-VO provides reliable metric-scale reconstructions, overcoming the scale ambiguity of monocular SLAM algorithms. Importantly, our approach relies solely on a single plenoptic sensor and avoids complex initialization, as depth priors are computed directly from plenoptic imaging. Experiments show that PRISM-VO outperforms the current state-of-the-art plenoptic visual odometry method on indoor and outdoor scenes. The proposed approach rivals other optimization- and learning-based methods while accurately and reliably recovering a metric scale of the scene. Project page: this https URL

Comments:
Accepted for publication at the 19th European Conference on Computer Vision (ECCV) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2607.00176 [cs.CV]

(or
arXiv:2607.00176v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2607.00176

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
140. 【2607.00174】Steal the Patch Size: Adversarially Manipulate Vision-Language Models

链接https://arxiv.org/abs/2607.00174

作者:Kai Hu,Akash Bharadwaj,Weichen Yu,Matt Fredrikson

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:private vision-tokenizer configurations, input preprocessing pipeline, deployed vision-language models, recovers private vision-tokenizer, black-box model-stealing attack

备注

点击查看摘要

Abstract:We present a black-box model-stealing attack that recovers private vision-tokenizer configurations of deployed vision-language models (VLMs), including the visual patch size and input preprocessing pipeline. The key idea is a task-level side channel induced by ViT-style patchification: when a synthetic grid image is aligned with the hidden patch grid, boundary cues are erased at tokenization, causing periodic accuracy drop. By sweeping the grid cell size and measuring these collapses, we infer the patch size; by introducing padding and a consistency-check test, we further identify whether preprocessing is dynamic- or fixed-resolution and recover the target resize resolution. Across open-source Qwen-VL variants and proprietary models including GPT and Claude, we reliably recover tokenizer-related parameters. Finally, we show that such leakage enables preprocessing-aware transfer attacks and model-targeted adversarial manipulation.

141. 【2607.00159】Identifying and Resolving Pitfalls of Knowledge-Based VQA Benchmarks: Auditing, Repairing, and Augmenting

链接https://arxiv.org/abs/2607.00159

作者:Qian Ma,S M Rayeed,Charles V. Stewart,Qiong Wu,Yao Ma

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Visual Question Answering, Visual Language Models, external structured knowledge, existing KB-VQA benchmarks, Visual Language

备注: Accepted to ECCV 2026. The datasets and code are available in [this https URL](https://github.com/VAN-QIAN/ECCV26-ARA)

点击查看摘要

Abstract:Knowledge-Based Visual Question Answering (KB-VQA) aims to evaluate whether Visual Language Models (VLMs) can retrieve, ground, and reason over external structured knowledge beyond visual evidence. In practice, answer accuracy is widely adopted as the primary evaluation metric, implicitly treating correctness as a proxy for knowledge-grounded reasoning. However, for existing KB-VQA benchmarks, this proxy relies on critical assumptions that are often overlooked and rendered unreliable by benchmark issues: annotated answer must be derivable from the associated knowledge base, question must be well-posed with sufficient constraints, and visual setting must meaningfully require grounded disambiguation. In this work, we show that these assumptions are systematically violated in existing KB-VQA benchmarks. Our audit reveals substantial instances with missing or contradicted answers and underspecified questions that render accuracy a misleading metric. Furthermore, we find that existing datasets rely on visually trivial, single-entity scenes that bypass the need for sophisticated visual-to-knowledge mapping. We demonstrate that even with controlled architectures, these flaws lead to distorted model rankings and overestimations of reasoning capabilities. To address this, we introduce (1) a principled audit-and-repair protocol that restores answer derivability and question clarity, and (2) a controlled multi-entity augmentation protocol that introduces visual ambiguity to challenge initial retrieval and grounded reasoning. Re-evaluation under corrected and augmented settings yields markedly different performance trends. Our findings call for rethinking evaluation protocols and designing more interaction-aware KB-VQA benchmarks that prioritize verifiable reasoning over simple matching.

142. 【2607.00157】Progressive Pose-Guided 4D Animal Reconstruction from Monocular Video

链接https://arxiv.org/abs/2607.00157

作者:Siyuan Li,Weiying Chen,Yilin Wang,Xinxin Zuo,Xingyu Li,Li Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large inter-species variation, complex articulations, inter-species variation, reliable templates, challenging due

备注: Accepted to ECCV 2026. Camera-ready author version

点击查看摘要

Abstract:Reconstructing 4D animals from monocular videos is challenging due to large inter-species variation, complex articulations, and the lack of reliable templates. Existing approaches typically rely on either strict category-specific priors that restrict generalization, or unconstrained generative models that sacrifice input fidelity. To bridge this gap, we present a progressive test-time optimization framework built on 3D Gaussian Splatting for high-fidelity 4D animal reconstruction from a single video. Our key insight is that a coarse shape prior suffices when coupled with a progressive strategy that disentangles articulated pose from non-rigid deformation. Specifically, we employ a symmetry-aware temporal encoding that exploits bilateral cues while absorbing camera estimation drift and a part-conditioned deformation mechanism guided by learnable part anchors and a learnable skinning field. Extensive experiments demonstrate that our approach generalizes robustly across diverse species, achieving superior geometric accuracy, temporal consistency, and visual fidelity compared to existing baselines, even under severe prior mismatch.

143. 【2607.00148】3D Point World Models: Point Completion Enables More Accurate Dynamics Learning

链接https://arxiv.org/abs/2607.00148

作者:Skand Peri,Hung Nguyen,Chanho Kim,Li Fuxin,Stefan Lee

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:potentially allowing robots, Learning predictive models, potentially allowing, allowing robots, robots to improvise

备注: 21 Pages

点击查看摘要

Abstract:Learning predictive models of the world enables robotic control through planning, potentially allowing robots to improvise solutions on new tasks. However, large video-based dynamics models lack explicit 3D spatial structure and suffer from geometrically inconsistent long-term rollouts with compounding errors. Emerging 3D dynamics models based on partial point clouds improve geometric consistency but remain sensitive to occlusions and accumulated prediction drift. To address these challenges, we present 3D Point World Models (3DPWM) - a task-agnostic world model that operates entirely in 3D space by first completing partial point clouds and then learning action-conditioned dynamics in this completed 3D scene. By operating on completed geometry, 3DPWM enables reliable long-horizon rollouts and more accurate cost evaluation for model-based planning while supporting adaptation to new tasks. Experiments across different robotic embodiments and tabletop manipulation benchmarks demonstrate that 3DPWM achieves significantly more reliable long-horizon rollouts (100-300+ steps), supports both open-loop and closed-loop planning, and enables successful sim-to-real transfer.

144. 【2607.00144】A Mechanism-Driven Theory of Phase Transitions in Active Learning

链接https://arxiv.org/abs/2607.00144

作者:Julia Machnio,Mads Nielsen,Mostafa Mehdipour Ghazi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:heuristic label counts, datasets or architectures, typically defined, defined by heuristic, heuristic label

备注: Accepted at ECCV 2026

点击查看摘要

Abstract:Active learning (AL) performance is known to be budget-dependent, yet regimes are typically defined by heuristic label counts that fail to generalize across datasets or architectures. We characterize AL dynamics by reframing budget regimes as shifts in the dominant generalization mechanism. By reinterpreting PAC-style risk components as dynamic interacting terms, we prove that dominance shifts are structurally unavoidable, creating a moving bottleneck for generalization. We operationalize this using measurable proxies and a segmented regression procedure to identify a tripartite taxonomy: data-driven, transition, and model-driven phases. Our framework explains the long-standing observation that representativeness, coverage, and uncertainty strategies excel at different stages. Experiments across natural and medical imaging show that AL efficiency depends on the alignment between the strategy's inductive bias and the active bottleneck. Moreover, self-supervised representation shift transitions earlier along the labeling trajectory, highlighting the role of representation quality in shaping AL dynamics. Overall, this work provides a unified framework for the next generation of transition-aware AL algorithms.

145. 【2607.00138】MG-SpaIR: Multi-grade Sparse-guided Implicit Representation for Training-Data-Free Image Restoration

链接https://arxiv.org/abs/2607.00138

作者:Jianmin Liao,Lei Huang,Ronglong Fang,Ashley Prater-Bennette,Lixin Shen,Yuesheng Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single observation corrupted, framework for restoring, mixture of blur, missing pixels, restoring a clean

备注

点击查看摘要

Abstract:MG-SpaIR is a training-data-free framework for restoring a clean image from a single observation corrupted by a mixture of blur, downsampling, noise, and missing pixels. Building on implicit neural representations (INRs), we introduce a multi-grade coarse-to-fine residual hierarchy that progressively refines the reconstruction across resolution grades, improving representational fidelity and mitigating spectral limitations. To stabilize reconstruction optimization and suppress INR-induced artifacts, we further propose an explicit sparse proximal regularization (e.g., $\ell_0$-type) applied directly in the high-resolution image domain, which discourages spurious high-frequency patterns while preserving sharp structures. The resulting optimization is solved efficiently via a multi-grade proximal alternating scheme, and we establish convergence guarantees for the associated updates under standard regularity conditions. Experiments on mixed-degradation benchmarks demonstrate that MG-SpaIR consistently outperforms strong training-data-free baselines such as Deep Image Prior, providing a stable, interpretable, and data-efficient alternative to conventional learning-based restoration methods.

146. 【2607.00129】A Synthetic-Driven Vision System for Assembly Step Recognition

链接https://arxiv.org/abs/2607.00129

作者:Hui Zhang,Xuanang Lei,Rui Wang,Julian Ferchow,Mirko Meboldt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ensuring production reliability, Quality control, preventing costly defects, production reliability, process is crucial

备注: Accepted by CASE 2026

点击查看摘要

Abstract:Quality control in industrial assembly is essential, and real-time monitoring of the assembly process is crucial for preventing costly defects and ensuring production reliability. Vision-based automated inspection offers a powerful solution for such real-time monitoring. However, due to the specialized industrial components and processes, training these models typically relies on task-specific real-world data, which is costly and labor-intensive to collect and annotate. In this paper, we propose a system that automatically generates realistic assembly sequences and further trains real-time inspection models using the synthetic data. It can be efficiently applied to a given task within an hour, requiring only CAD models and simple step descriptions. Focusing on practical challenges, our system integrates a physics-based motion generation module to capture the variance of different human assembly, designs domain-randomized rendering to deal with the environmental complexity and variation, and employs an object-detection-based step recognition module for robust sim-to-real transfer, leading to 92.4% accuracy on a real-world assembly case with 46.7%, 15.8% and 61.2% performance improvement, respectively. Overall, our system provides a practical solution for industrial assembly inspection without requiring expensive real-world data collection and annotation, with the effectiveness validated on real industrial assembly tasks.

147. 【2607.00125】Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners

链接https://arxiv.org/abs/2607.00125

作者:Yunhan Wang,Eshika Khandelwal,Edson Araujo,Walid Bousselham,Nina Shvetsova,Hilde Kuehne

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated remarkable abilities

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable abilities when analyzing images, yet translating these capabilities to few-shot image classification remains challenging. To bridge this gap, we present DeCoDe, a simple yet effective technique that enables off-the-shelf MLLMs to act as strong few-shot classifiers without any additional training. Our approach builds on the idea of few-shot classification as a set of pairwise image comparisons, decomposing the task into a set of binary decisions. Given a query image and a support image from a candidate class, the MLLM is prompted to decide whether the two images depict the same class. The logit corresponding to an affirmative response is then used as a similarity score to assign the query image to the most likely class. While this already yields good results, we show that providing additional high-level information, such as the data domain, to the model further improves performance. Our evaluation provides an extensive analysis of various inference variants on a suite of twelve datasets, six established and six newly curated few-shot benchmarks spanning across diverse domains. The results show that the proposed simple decomposition technique can turn off-the-shelf MLLMs into powerful few-shot learners, significantly outperforming current state-of-the-art few-shot methods on both standard and novel domains. Code is available at this https URL.

148. 【2607.00124】Segmenting, Fast and Slow: Real-Time Open-Vocabulary Video Instance Segmentation with Dual-Path Processing

链接https://arxiv.org/abs/2607.00124

作者:Luca Barsellotti,Martin Sundermeyer,Mattia Segu,Nikita Araslanov,Muhammad Ferjad Naeem,Marcella Cornia,Yongqin Xian,Maxim Berman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Object-centric models inspired, inspired by DETR, Object-centric models, dominant paradigm, open-vocabulary video instance

备注: ECCV 2026

点击查看摘要

Abstract:Object-centric models inspired by DETR have become the dominant paradigm for open-vocabulary video instance segmentation (OV-VIS). While recent efforts have reduced the computational cost of pixel decoding, textual modality fusion, and object decoding to make these architectures more suitable for mobile devices, real-time on-device inference at high frame rates remains an open challenge. In this paper, we introduce SegFS, a dual-stream fast-slow framework that significantly improves efficiency without sacrificing accuracy. On sparse keyframes, an open-vocabulary object-based model predicts instance-level representations. These representations are then projected back into the backbone feature space to condition a lightweight fast network, which efficiently relocalizes and segments the instances in subsequent frames. By shifting instance propagation from object decoding to feature-space conditioning, our approach decouples multimodal semantic understanding from dense mask prediction and enables efficient temporal propagation. The proposed fast branch achieves up to 14x lower latency than the mobile-oriented MOBIUS model, while maintaining competitive segmentation performance on standard OV-VIS benchmarks.

149. 【2607.00115】PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

链接https://arxiv.org/abs/2607.00115

作者:Dengxian Gong,Yuanzheng Wu,Haobo Yuan,Zhengdong Hu,Tao Zhang,Yikang Zhou,Shihao Chen,Quanzhu Niu,Kai Wang,Jason Li,Haochen Wang,Lu Qi,Shunping Ji,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper explores multi-turn, multi-turn visual reasoning, explores multi-turn visual, leading to long, MLLMs repeatedly fail

备注: 22pages, 10 figures

点击查看摘要

Abstract:This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual reasoning agent that explicitly decouples reasoning from perception, i.e., the reasoner decides what to look for, while a specialized perception tool answers where it is. Specifically, PixelEyes introduces 1) Mask-guided Visual Search. A referring segmentation model is invoked to provide mask-precise localization, freeing the reasoner from the need to compensate for imprecise grounding. 2) Semantic-region Breadth-first Search (BFS). To eliminate redundant loops caused by repeatedly cropping incorrect sub-regions, we organize exploration as a breadth-first search over semantic regions. To internalize these capabilities, we construct the PixelEyes-6K dataset by resynthesizing expert trajectories from existing data. This explicitly embeds our mask-guided search and BFS logic into the model. We further introduce Pinpoint-Bench, a zero-hint visual search benchmark, i.e., no location cues are provided in the question, with instance-level masks and bounding boxes that separate localization failures from reasoning failures, enabling fine-grained analysis of failure modes such as inattentional blindness. Recent state-of-the-art MLLMs and visual reasoning agents leave large headroom on Pinpoint-Bench, demonstrating its quality and difficulty. Code and models are open-sourced.

150. 【2607.00090】Lost in the Tail: Addressing Geographic Imbalance in Urban Visual Place Recognition

链接https://arxiv.org/abs/2607.00090

作者:Zhiyao Shu,Jiacheng Yang,Yang Lu,Waishan Qiu,Chuan Li,Da Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Place Recognition, Urban-scale Visual Place, Place Recognition, Visual Place, Urban-scale Visual

备注: Accepted to ECCV 2026, 28 pages including supplementary material

点击查看摘要

Abstract:Urban-scale Visual Place Recognition (VPR) aims to identify the geographic location of a query image by matching it against a geo-tagged database. While recent methods achieve impressive performance, they overlook a serious long-tailed problem hidden in urban-scale datasets, which biases the model towards locations with abundant images and ignores less-visited areas, causing models to systematically favor frequently photographed locations while failing in sparsely covered areas. In this paper, we systematically characterize this imbalance challenge and propose Distribution-Aware Place Recognition (DAPR), a model-agnostic plug-in framework that rebalances gradient contributions across head and tail classes. Additionally, within classification-retrieval pipelines, DAPR applies a multi-scale distance search mechanism to compute per-class distributional compactness, providing complementary gains at the retrieval stage. On the large-scale SF-XL benchmark, our framework outperforms the previous classification-retrieval baseline by 18.3% on test set v1, and 6.7% on test set v2. As a plug-in module, it achieves consistent improvements across representative VPR methods on SF-XL, MSLS, and Pitts30k, demonstrating broad generalizability across different methods and benchmarks.

151. 【2607.00060】Synergistic Perception-Reasoning Governance: Grounding Medical MLLMs with Verifiable Anatomical Evidence

链接https://arxiv.org/abs/2607.00060

作者:Rui Hao,Qiankun Li,Junyuan Mao,Linghao Meng,Dirui Xie,Dayu Tan,Zhigang Zeng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, show strong promise, radiology report generation, produce fluent conclusions, large language models

备注: Accepted by MICCAI 2026 (Early Accept, Top 9%)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) show strong promise for clinical VQA and radiology report generation, yet inference-time hallucinations still undermine trustworthy use: models can produce fluent conclusions that conflict with imaging evidence. Existing mitigation strategies typically rely on additional training, external retrieval/knowledge bases, or multi-stage post-hoc verification, which increases cost and pipeline complexity and often generalizes poorly across models and this http URL address this, we propose a holistic, training-free evidence-injection framework that systematically mitigates hallucinations through dual-side evidence injection. By leveraging ROI priors acquired using MedSAM in our implementation, we recalibrate the visual perception trajectory via ROI-guided activation modulation while anchoring the textual reasoning trajectory by mapping anatomical coordinates into discrete semantic tokens as verifiable external memory. Then we introduce a task-aware dynamic router to select modality-specific interventions based on task semantics, balancing perceptual grounding and linguistic fluency. We conduct systematic evaluations on 2 tasks and 5 datasets using \texttt{LLaVA-1.5-7B}, \texttt{LLaVA-Med-1.5-7B}, \texttt{Qwen3-VL-8B/32B}, and \texttt{InternVL-3.5-8B/38B}. Controlled ablations and visualizations further validate the framework, which consistently outperforms baselines across medical benchmarks, improving close-ended accuracy by up to $\sim\mathbf{6}\%\uparrow$ and reducing open-ended hallucinations by $\sim\mathbf{35}\%\downarrow$. The code has been made available on GitHub: \href{this https URL}{\textcolor{blue}{this https URL}}.

152. 【2607.00058】Joint Medical Image Enhancement and Segmentation with Diffusion-based Symbiotic Information Interaction

链接https://arxiv.org/abs/2607.00058

作者:Ying Chen,Jinyue Li,Qiankun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Symbiotic Information Interaction, accurate medical diagnosis, critical for accurate, Diffusion-based Symbiotic Information, Information Interaction Network

备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Image quality is critical for accurate medical diagnosis. However, MRI, CT, and ultrasound images are often of low resolution and quality due to cost constraints, complicating the visualization of key anatomical structures and lesions. While such limitations are common in practice, traditional methods treat image enhancement as a separate preprocessing step, failing to fully leverage its potential synergy with image segmentation. To address this, we propose DiSIINet (Diffusion-based Symbiotic Information Interaction Network), which is built on the principle that enhancement and segmentation should mutually reinforce each other in a unified model. Based on Denoising Diffusion Implicit Models (DDIM), DiSIINet integrates an enhancement branch and a segmentation branch. These branches interact through a novel Symbiotic Information Interaction (SII) module, which facilitates dynamic, feature-level information exchange via cross-attention during the reverse diffusion process. This design enables both tasks to iteratively improve each other. The DDIM backbone ensures high-quality output and efficient inference through deterministic sampling. Experiments on multi-modal medical datasets (MRI, CT, ultrasound) show that DiSIINet achieves significant performance improvements compared to sequential or independent enhancement and segmentation approaches. The code is available at: this https URL.

153. 【2607.00057】Enhancing Oracle Bone Inscription Recognition via Multi-Scale Layer Attention

链接https://arxiv.org/abs/2607.00057

作者:Chaowen Yan,Kaishen Wang,Yong Wang,Jianlong Xiong,Tao He

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Oracle Bone Inscriptions, ancient Chinese culture, understanding ancient Chinese, Oracle Bone, Bone Inscriptions

备注

点击查看摘要

Abstract:Oracle Bone Inscriptions (OBIs) recognition plays a crucial role in understanding ancient Chinese culture. However, accurately recognizing OBIs remains highly challenging due to their complex, irregular, and often degraded shapes. Traditional methods rely on expert knowledge and manual analysis, which are time-consuming and error-prone. Although deep learning has greatly advanced general image recognition, existing methods struggle to capture the fine-grained details and subtle variations inherent in OBIs, resulting in limited performance. Even most recent and effective layer attention techniques are designed to capture fine-grained dependencies through enhanced inter-layer interactions, yet they still exhibit only marginal improvements in OBIs recognition. To address these limitations, we propose Multi-Scale Layer Attention (MSLA), a novel paradigm that explicitly models both multi-scale and cross-layer feature interactions. By enriching the representation with fine-grained details across multiple spatial scales, MSLA enables more accurate and robust OBIs recognition. Extensive experiments on large-scale OBIs datasets demonstrate that MSLA consistently outperforms existing attention mechanisms while maintaining computational efficiency.

154. 【2607.00047】Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double

链接https://arxiv.org/abs/2607.00047

作者:Adam Cole,Mick Grierson

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Hitchcock Vertigo, Vertigo Vertigo, Vertigo Vertigo extends, reconstruction of Hitchcock, original film frames

备注: Accepted to Ars Electronica EXPANDED 2026 - Conference on Animation and Interactive Art (in cooperation with ACM SIGGRAPH), Ars Electronica Festival, Linz. 7 pages, 7 figures. Authors' version

点击查看摘要

Abstract:Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock's Vertigo (1958), generated from only 2.78% of the original film's frames. Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. Vertigo is itself a film about the obsessive reconstruction of an artificial ideal; Vertigo Vertigo extends this logic to the material of the film, treating the canonical text as a probe for the normative conventions of classical cinema encoded within generative systems. Evaluated through computational analysis and critical feedback from media theorists (Lev Manovich, Shane Denson, Kevin L. Ferguson), the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model's latent priors. Aesthetically, the reconstruction is rendered as an unstable overlay between the original film and its predictive shadow, fueling a persistent doubt in the viewer's perception of authenticity -- a 21st-century vertigo. The work argues that generative media is not a paradigm shift from cinema but an acceleration of its logic of desire and false authenticity, extending from classical Hollywood through to the predictive media environments now reshaping contemporary perception.

155. 【2607.00033】Learning Dexterous Manipulation Using Contact Wrench Guidance From Human Demonstration

链接https://arxiv.org/abs/2607.00033

作者:Xinghao Zhu,Zixi Liu,Shalin Jain,Chenran Li,Milad Noori,Huihua Zhao,John Welsh,Michael Andres Lin,Wei Liu,Tingwu Wang,Xingye Da,Zhengyi Luo,Vishal Kulkarni,Naema Bhatti,Yuke Zhu,Linxi Fan,Bowen Wen,Danfei Xu,Soha Pouya,Yan Chang

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:policies remains challenging, remains challenging, Robotic Dexterous Manipulation, Contact Wrench Guidance, Contact Wrench

备注

点击查看摘要

Abstract:Dexterous robot manipulation can benefit from the abundance of human demonstrations, but transferring such demonstrations to robot policies remains challenging. We present Contact Wrench Guidance from Human Demonstration in Robotic Dexterous Manipulation (CHORD), a framework for long-horizon manipulation of rigid and articulated objects with reinforcement learning. The key idea is object-centric contact wrench space guidance: we represent human and robot motions by the forces and torques they can induce on the object, enabling similarity to be measured by the induced instantaneous motions. This guidance makes reinforcement learning more scalable for contact-rich dexterous manipulation. We further introduce a large-scale simulation benchmark with 4,739 bimanual dexterous manipulation tasks, constructed from motion-capture datasets and reconstructed in-house videos. Evaluated on 1,831 benchmark tasks, CHORD achieves an average success rate of 82.12%, demonstrating strong scalability. CHORD also generalizes to whole-body manipulation from hand-only and third-person demonstrations, achieving a 90.77% success rate, and the learned policies transfer to the real world in both open-loop and closed-loop settings.

156. 【2607.00015】owards an automated AI-based framework for floor plan compliance checks for residential buildings

链接https://arxiv.org/abs/2607.00015

作者:Subash Gautam,Debaditya Acharya,Alexandra Kleeman,Sarah Foster

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:Australia urban areas, improve residents' well-being, introduced policy reforms, well-being in Australia, governments have introduced

备注

点击查看摘要

Abstract:To improve residents' well-being in Australia's urban areas, governments have introduced policy reforms such as SEPP65, BADS, and SPP7.3 to enhance apartment design quality. These regulations require precise geometric and spatial analysis to evaluate health-related features, including daylight access, natural ventilation, privacy, and space efficiency. However, compliance checking remains challenging due to its manual, time-intensive nature. Additionally, evolving policies limit scalability for large-scale assessments across thousands of apartments. Existing automated floor plan analysis methods are fragmented and typically focus on single apartments, lacking a unified framework for multi-unit compliance checking. This article explores current advancements in automated floor plan analysis, particularly AI-driven approaches, and highlights key challenges in their practical adoption. To address these gaps, a conceptual framework is proposed for automated compliance checking in multi-apartment buildings. A Large Language Model (LLM) is used within a Rule Engine to convert textual building codes into executable, explainable rules. A Data Extraction Engine segments floor plan images into elements such as walls, rooms, fixtures, text, and symbols, and transforms them into a structured building graph with topological relationships. This structured representation is then evaluated by a Compliance Check Engine, which leverages LLM-generated rules for assessment. The proposed framework offers a scalable, consistent, and transparent approach to automated compliance checking across jurisdictions, supporting efficient enforcement of apartment design standards and promoting healthier, higher-density urban development.

157. 【2511.18050】UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

链接https://arxiv.org/abs/2511.18050

作者:Tian Ye,Song Fei,Lei Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:diverse aspect ratios, aspect ratios exposes, tightly coupled failure, coupled failure mode, failure mode spanning

备注: Project Page: [this https URL](https://w2genai-lab.github.io/UltraFlux/)

点击查看摘要

Abstract:Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

158. 【2507.15692】Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

链接https://arxiv.org/abs/2507.15692

作者:Meng Chen,Akhil Iyer,Amy Pavel

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, Multimodal large, large language models, access visual information, provide new opportunities

备注: 18 pages, 6 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users' ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado's path to posting an image on social media.

159. 【2607.00500】Closed-loop coupling of personalised and foundation models for real-time treatment guidance with MRI

链接https://arxiv.org/abs/2607.00500

作者:James Grover,Emily A. Hewson,Andrew Phair,Michael Ferraro,Hilary L. Byrne,Paul Keall,Michael G. Jameson,David E.J. Waddington

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词:deep brain stimulation, biopsy and deep, brain stimulation, deep brain, Image-guided therapies

备注: 18 pages, 8 figures, 2 supplementary figures

点击查看摘要

Abstract:Image-guided therapies, including radiotherapy, biopsy and deep brain stimulation, rely on real-time targeting of anatomical structures. However, in the presence of motion, imaging latencies create a temporal misalignment between observed and true anatomy, compromising treatment accuracy. Artificial intelligence-based frameworks have increasingly been presented to close this latency gap, but leading personalised models can fail due to a lack of stable anatomical grounding. Foundation models can provide grounded behaviour, but they do not adapt to real-time, individual patient dynamics. Here we introduce a closed-loop coupling framework that synergises patient-specific temporal prediction with continuous segmentation-based anatomical interpretation from a foundation model. A personalised model predicts future anatomy to compensate for system latency, while a streaming foundation model provides anatomical supervision used to continuously update the temporal predictor in real time during treatment. We validate the framework using a digital phantom and intrafraction magnetic resonance imaging (MRI) from patients undergoing MRI-guided radiotherapy. For a prediction horizon of 400 ms, the proposed method improves anatomical prediction and reduces dosimetric error compared with existing approaches, within clinically relevant latency constraints. These results establish closed-loop coupling as a general strategy for real-time image-guided intervention.

160. 【2607.00472】Predicting Lethal Outcome (Cause) And Understanding Key Biomarkers Linked With Acute Myocardial Infarction Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

链接https://arxiv.org/abs/2607.00472

作者:Sagnik Ghosh

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Cardiovascular disease, Cardiovascular, Acute myocardial infarction, heart, heart attack

备注: Master of Science (MSc), Thesis Report

点击查看摘要

Abstract:Cardiovascular disease is still one of the main causes of death around the world. Acute myocardial infarction (MI), or heart attack, claims millions of lives each year. MI happens when blood flow to the coronary arteries is blocked or reduced, which causes permanent damage to the heart muscle. Without treatment, this can lead to cardiac arrest, where the heart stops pumping blood to the organs, resulting in organ failure and death. Even survivors often face serious problems like heart failure, pulmonary edema, and asystole. Research shows that 5 to 10 percent of survivors die within the first year after an MI, and nearly half need to be hospitalized again. Early thrombolytic treatment leads to better outcomes, so there is a clear need for faster and more accurate ways to diagnose MI. Right now, doctors usually review patient history and use their own experience to find the causes of MI. This process takes a lot of time and can be inconsistent. Detecting MI accurately and quickly can help patients take better care of themselves and prevent fatal events. In this study, we introduce an automated model to predict deadly outcomes of MI and help doctors understand important biomarkers linked to its complications. This approach aims to make diagnosis clearer, faster, and more affordable. The process includes preparing the data, filling in missing values, and handling imbalanced data using SVMSMOTE, ADASYN, and class-weighted methods. We use wrapper and embedded feature selection to find the most important variables, then scale the features for consistency. The model combines Logistic Regression, Random Forest, Light-GBM, and Bagging SVM, and is further improved with an artificial neural network to increase accuracy. We evaluate all models using precision, recall, and other key measures to find the best option for clinical use.

161. 【2607.00385】MalariAI: A Label-Resilient Decoupled Framework for Universal Cell Segmentation and Explainable Stage Classification in Dense Malaria Blood Smears

链接https://arxiv.org/abs/2607.00385

作者:Kaysarul Anas Apurba,Md Hasibul Hasan,Mohammed Ali,Tanzilur Rahman

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated malaria diagnosis, expert microscopists remains, resource-limited settings, Automated malaria, accurate diagnosis

备注: Submitted to Computerized Medical Imaging and Graphics (under review). 4 authors, includes figures and appendix

点击查看摘要

Abstract:Automated malaria diagnosis from blood smear microscopy is a critical challenge in global health AI; in resource-limited settings, the scarcity of expert microscopists remains the primary bottleneck to timely and accurate diagnosis. Three compounding failure modes prevent reliable clinical deployment of existing deep learning systems. First, end-to-end detectors treat unannotated cells as background during training, producing recall figures that are strongly influenced by annotation completeness rather than reflecting true cell recovery. Second, Non-Maximum Suppression tends to suppress valid detections in dense smear regions where infection counts matter most. Third, existing whole-slide detection pipelines lack per-cell spatial evidence for clinical audit, despite image-level explainability methods such as Grad-CAM having been applied to malaria image classification tasks. We present MalariAI, a two-stage decoupled framework that addresses all three failure modes in a unified pipeline. Stage 1 applies an annotation-agnostic distance-transform guided watershed algorithm to isolate every cell in a full 1600x1200 blood smear image, recovering 75.95% of ground-truth cells by centroid localisation across the 120-image NIH BBBC041 test set without any ground-truth input. Stage 2 fine-tunes EfficientNet-B0 with Focal Loss (gamma = 2.0, per-class inverse-frequency weights) on 64x64 crops, achieving 98.36% overall classification accuracy with 87.5% and 75.0% per-class accuracy on the rare schizont and gametocyte stages, compared to only 24.57% and 25.95% AP for a Faster R-CNN baseline on the same classes. Grad-CAM++ heatmaps generated per detected cell provide instance-level spatial evidence for clinical audit, enabling microscopists to verify model predictions at the individual parasite level without sacrificing classification performance.