本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新693篇论文,其中:
- 自然语言处理94篇
- 信息检索12篇
- 计算机视觉155篇
自然语言处理
1. 【2607.02513】LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning
链接:https://arxiv.org/abs/2607.02513
作者:Matteo Boglioni,Thibault Rousset,Siva Reddy,Marius Mosbach,Verna Dankers
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:LLMs memorize sensitive, sensitive training data, including personally identifiable, personally identifiable information, memorize sensitive training
备注:
点击查看摘要
Abstract:LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.
2. 【2607.02512】Program-as-Weights: A Programming Paradigm for Fuzzy Functions
链接:https://arxiv.org/abs/2607.02512
作者:Wentao Zhang,Liliana Hotsko,Woojeong Kim,Pengyu Nie,Stuart Shieber,Yuntian Deng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:repairing malformed JSON, clean rule-based implementation, important log lines, tasks resist clean, resist clean rule-based
备注:
点击查看摘要
Abstract:Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
3. 【2607.02510】Online Safety Monitoring for LLMs
链接:https://arxiv.org/abs/2607.02510
作者:Mona Schirmer,Metod Jazbec,Alexander Timans,Christian Naesseth,Maja Waldron,Eric Nalisnick
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
关键词:LLMs remain prone, generating unsafe outputs, alignment training, LLMs remain, deployment time
备注: ICML 2026 Hypothesis Testing Workshop
点击查看摘要
Abstract:Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
4. 【2607.02507】What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates
链接:https://arxiv.org/abs/2607.02507
作者:Arman Ghaffarizadeh,Danyal Mohaddes,Aliakbar Izadkhah,Shahriar Noroozizadeh
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:socially structured settings, LLM agents, increasingly act, act in socially, socially structured
备注:
点击查看摘要
Abstract:LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.
5. 【2607.02504】Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
链接:https://arxiv.org/abs/2607.02504
作者:Yuxuan Li,Lingxi Xie,Xinyue Huo,Jihao Qiu,Jiacheng Shao,Pengfei Chen,Jiannan Ge,Kaiwen Duan,Qi Tian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:comprehensive video understanding, deciphering complex storyline, Long-form TV dramas, video understanding, dramas present
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: this https URL.}
6. 【2607.02494】owards Robustness against Typographic Attack with Training-free Concept Localization
链接:https://arxiv.org/abs/2607.02494
作者:Bohan Liu,Wenqian Ye,Guangzhi Xiong,Zhenghao He,Sanchit Sinha,Aidong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Contrastive Language-Image Pretraining, Large Vision Language, Vision Language Models, modern Large Vision, CLIP models exhibit
备注: 15 pages main text, provisionally accepted to ECCV 2026
点击查看摘要
Abstract:Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at this https URL.
7. 【2607.02490】Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
链接:https://arxiv.org/abs/2607.02490
作者:Liyan Tang,Fangcong Yin,Greg Durrett
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:generating textual chains, Large vision-language models, Large vision-language, chains of thought, reason over multimodal
备注:
点击查看摘要
Abstract:Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.
8. 【2607.02473】Audio-Based Understanding of Audiobook Narration Appeal
链接:https://arxiv.org/abs/2607.02473
作者:Shahar Elisha,Mariano Beguerisse-Díaz,Emmanouil Benetos
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:audiobook listening experience, listening experience, shaping how listeners, understand the content, listeners engage
备注: Accepted to Interspeech 2026
点击查看摘要
Abstract:Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.
9. 【2607.02469】stEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
链接:https://arxiv.org/abs/2607.02469
作者:Jiale Amber Wang,Kaiyuan Wang,Pengyu Nie
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:code change, test, code, software behavior, test generation
备注: TestEvo-Bench leaderboard and data explorer are hosted at [this https URL](https://www.testevo-bench.com)
点击查看摘要
Abstract:Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model's training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.
Comments:
TestEvo-Bench leaderboard and data explorer are hosted at this https URL
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2607.02469 [cs.SE]
(or
arXiv:2607.02469v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2607.02469
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2607.02464】Will Scaling Improve Social Simulation with LLMs?
链接:https://arxiv.org/abs/2607.02464
作者:Caleb Ziems,William Held,Su Doga Karaca,David Grusky,Tatsunori Hashimoto,Diyi Yang
类目:Computation and Language (cs.CL)
关键词:Large Language Model, promising research method, Large Language, adopted widely, research method
备注:
点击查看摘要
Abstract:Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.
11. 【2607.02459】Language Models as Measurement Apparatus for Culture
链接:https://arxiv.org/abs/2607.02459
作者:Kent K. Chang
类目:Computation and Language (cs.CL)
关键词:quantify cultural phenomena, measurement distinctively cultural, makes such measurement, measurement distinctively, Language models
备注: Accepted to the Big Picture workshop co-located with ACL 2026. This version expands the camera-ready (adding Fig. 3 and section 6.3, as well as correcting minor typos) in Proceedings of The Big Picture v2: Crafting a Research Narrative, pp. 131--143, San Diego, CA, USA. Association for Computational Linguistics
点击查看摘要
Abstract:Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus -- model, data, annotation, evaluation -- participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad's concept of the agential cut -- the contingent boundary between phenomenon and instrument -- I show that the apparatus's substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, interaction, and deviation) and three examinations of the apparatus itself (erasure of cultural markers, attunement to historical material, and agency in an agentic workflow). This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment, at once methodological and ethical.
12. 【2607.02440】EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments
链接:https://arxiv.org/abs/2607.02440
作者:Zhilin Wang,Han Song,Runzhe Zhan,Jusen Du,Jiacheng Chen,Tianle Li,Qingyu Yin,Yulun Wu,Zhennan Shen,Tong Zhu,Yanshu Li,Guanjie Chen,Derek F. Wong,Yafu Li,Yu Cheng,Yang Yang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:open-ended software-engineering progress, Autonomous Policy Evolution, software-engineering progress, increasingly expected, collapse this process
备注: 24 pages
点击查看摘要
Abstract:Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that distinguish how agents allocate budget, convert feedback into parametric tuning. These analyses show that strong autonomous policy evolution depends not only on isolated task wins, but on discovering task-appropriate mechanisms and refining policies under bounded feedback.
13. 【2607.02432】Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
链接:https://arxiv.org/abs/2607.02432
作者:Manuel Alonso-Carracedo,Ruben Fernandez-Boullon,Pedro Celard,Francisco J.Rodriguez-Martinez,Lorena Otero-Cerdeira
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:handle partial credit, command-line examinations remains, rising enrolments make, enrolments make manual, make manual marking
备注:
点击查看摘要
Abstract:Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.
14. 【2607.02416】he Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing
链接:https://arxiv.org/abs/2607.02416
作者:David Jurgens
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Large Language Models, Natural Language, Language Processing, general Machine Learning
备注:
点击查看摘要
Abstract:Natural Language Processing (NLP) has traditionally been published in its core disciplinary venues like ACL. However, advances in Large Language Models (LLMs) has led to a blurring of the disciplinary lines between NLP and general Machine Learning (ML), with authors regularly publishing in venues from both fields. Here, we ask whether the disciplinary center of gravity is shifting. Using NLP research published from 2010 to 2026 and studies of both established and new authors, we find that a migration is taking place. First, comparing the pre- and post-LLM eras, established authors lost 19.2pp of share at flagship *ACL main-conference tracks while gaining 14.8pp in the newer Findings tracks, and general ML venues rose 8.6pp, even when adjusting for parallel growth in the fields. Second, among newer authors who debut with at least three first-author NLP-topic papers, the share whose work appears mostly at *ACL venues fell from 84% (2019) to 74% (2024), while the share appearing mostly at general ML venues rose from 5% to 21%. Using causal inference techniques, we estimate that these general ML venues confer a significant citation premium, which influences venue selection. Together, these results point to a significant shift in where NLP research is published.
15. 【2607.02383】Know Your Source: A Public Knowledge Store for Media Background Checks
链接:https://arxiv.org/abs/2607.02383
作者:Benjamin Nichols,Michael Schlichtkrull,Nedjma Ousidhoum
类目:Computation and Language (cs.CL)
关键词:LLM-based retrieval-augmented generation, LLM-based retrieval-augmented, automated fact-checking, RAG, AFC
备注: Code and Data: [this https URL](https://github.com/nedjmaou/mediaref)
点击查看摘要
Abstract:LLM-based retrieval-augmented generation (RAG) is increasingly used for automated fact-checking (AFC) and related tasks. By grounding LLM outputs in retrieved evidence, RAG-based systems provide transparent justifications while allowing external information to be updated independently of the underlying model. However, existing approaches often assume retrieved evidence is reliable, although real-world information may be conflicting, outdated, and can originate from unreliable or biased sources. Recent work on *source-critical reasoning* addresses this challenge through media background checks (MBCs) (Schlichtkrull, 2024), which assess the credibility of evidence sources to support downstream fact verification. However, generating MBCs relies on costly proprietary search APIs, limiting reproducibility. To mitigate this issue, we introduce MEDIAREF, a publicly available knowledge store of web-sourced documents that enables reproducible, low-cost evaluation of MBC generation across 200 media sources. We describe a reproducible methodology for constructing and updating the collection, assess widely used LLMs on the MBC generation task, and demonstrate that MEDIAREF supports higher-quality MBC generation through both automatic and qualitative evaluation.
16. 【2607.02381】HULAT2 at MER-TRANS 2026: Governed Multi-Agent Simplification for Spanish Easy-to-Read Generation
链接:https://arxiv.org/abs/2607.02381
作者:Lourdes Moreno,Paloma Martínez,Marco Antonio Sanchez-Escudero,Miguel Domínguez-Gómez
类目:Computation and Language (cs.CL)
关键词:track of MER-TRANS, Spanish track, fully automatic Spanish, paper describes, describes the participation
备注: 13 pages, 1 figure, 3 tables
点击查看摘要
Abstract:This paper describes the participation of HULAT2-UC3M in the Spanish track of MER-TRANS 2026, a shared task on multilingual Easy-to-Read translation. Three fully automatic Spanish runs were submitted. RUN1 and RUN2 used a LangGraph-based multi-agent workflow combining Gemini 2.5 Flash and RigoChat-7B-v2, parallel generation strategies, internal quality signals, Event-Condition-Action routing, controlled editing and traceable decisions. RUN1 used the base workflow, while RUN2 activated an additional lexical-support layer based on a glossary and lexical resources. RUN3 was a RigoChat-based generate-evaluate-regenerate baseline with prompt engineering and LoRA-based adaptation. The official leaderboard reports BLEU-Orig, BLEU-Gold, SARI and BERTScore. During development, additional internal signals were also inspected, including semantic fidelity, readability, lexical simplicity, syntactic clarity and factual consistency. According to official SARI, RUN1 was the best HULAT2 run, with 44.0543 points, followed by RUN2 with 43.1049 and RUN3 with 38.5136. These results indicate that, in this task setting, signal-guided multi-agent routing outperformed the linear regeneration baseline. They also show that adding lexical support did not automatically improve reference-based scores. Further segment-level and document-level analysis are required to assess readability, factual consistency and user-oriented adequacy.
17. 【2607.02369】World Wide Models: Literary Tools for Cultural AI
链接:https://arxiv.org/abs/2607.02369
作者:Nina Begus
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLMs stage, cultural encounter, negotiated cultural struggles, world literature, critical theory
备注: 15 pages
点击查看摘要
Abstract:LLMs stage a new form of cultural encounter that is massive, automated, and monolingual. Literary disciplines have always negotiated cultural struggles with comparative reading of literature, narratological and poetic analysis, critical theory, world literature, and translation. These tools have now become indispensable for building culturally literate AI. The essay develops a layered framework toward more nuanced textual models and pluralistic interpretations of AI, emphasizing the natural intersections of literature and AI development, connecting current debates in critical theory with structural monolingualism, and suggesting a new application of world literature approaches to address global AI textuality through macrostructure, circulation, and untranslatability.
18. 【2607.02345】SkillFuzz: Fuzzing Skill Composition for Implicit Intents Discovery in Open Skill Marketplaces
链接:https://arxiv.org/abs/2607.02345
作者:Jinwei Hu,Yi Dong,Youcheng Sun,Xiaowei Huang
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, Language Model, natural-language instruction documents, increasingly automate software
备注: Under Review
点击查看摘要
Abstract:Large Language Model (LLM)-based agents increasingly automate software engineering tasks through reusable skills, natural-language instruction documents that guide planning and execution. Open skill marketplaces enable users to assemble agents by co-activating community-contributed skills, but marketplace operators typically audit skills in isolation. As a result, individually benign skills may interact to redirect an agent toward unintended objectives, which we term implicit intents. Detecting such intents is challenging because the effect emerges only through skill composition, execution environments are often unavailable at admission time, and the space of possible co-activations grows exponentially with marketplace size. In this paper, we formulate implicit-intent discovery as a fuzzing problem over skill compositions, where skill compositions are the unit under test, planning artifacts expose agent intent before execution, and deviations from a skill-free baseline serve as a differential oracle. Based on this formulation, we propose skillfuzz, the first execution-free testing approach that extracts structured skill contracts and uses contract-guided Monte Carlo Tree Search to prioritize potentially conflicting compositions. Across representative skill-marketplace workloads, skillfuzz discovers over 1,000 distinct implicit intents under a fixed query budget, confirms more than 80% of the highest-risk flagged compositions during execution-time validation, and identifies substantially more high-severity implicit intents than alternative search strategies while exploring only a fraction of the pairwise interaction space they require.
19. 【2607.02338】HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report
链接:https://arxiv.org/abs/2607.02338
作者:Minghao Li,Raghav Mittal,Sanjivni Rana,Suraj Shetiya,Gautam Das,Nick Koudas
类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Navigable Small World, Hierarchical Navigable Small, Hierarchical Navigable, Small World, Navigable Small
备注: 23 pages, 22 figures, Submitted to VLDB2027
点击查看摘要
Abstract:Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel "Certify-then-Rectify" framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.
20. 【2607.02307】On the Role of Directionality in Structural Generalization
链接:https://arxiv.org/abs/2607.02307
作者:Zichao Wei
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:argument extraction positions, previous SOTA, modifier position shifts, involve directional distinctions, test categories explicitly
备注:
点击查看摘要
Abstract:Several SLOG test categories explicitly involve directional distinctions (modifier position shifts, argument extraction positions), yet AM-Parser, the previous SOTA, uses an AM algebra whose operations do not encode direction. We redesign the symbolic backend around CCG directed types (deterministic CKY + single linear decoder, 30K learnable parameters). Under the same BERT-base encoder, the system achieves 75.9$\pm$6.4% LF exact match, surpassing AM-Parser (70.8$\pm$4.3%). Per SLOG's own category groupings, gains are highly directional: the CCG system outperforms AM-Parser on all 5 position-shift categories (+29.9pp), while AM-Parser outperforms on all 6 recursive-depth categories. Replacing the encoder with DeBERTa-v3-large yields 90.7$\pm$4.9%, with the largest encoder gains in recursive-depth categories, complementary to directionality's gains. Directional representations shift the bottleneck from the symbolic layer (AM-Parser's 0% category ceiling) to the neural layer, which improves with encoder upgrades.
21. 【2607.02266】HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
链接:https://arxiv.org/abs/2607.02266
作者:Ziyun Qiao,Yue Min,Ruining Chen,Yujun Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:data-mixing methods assume, groups determines, assume the corpus, Learned Semantic Transform, partitioned into groups
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at one granularity; changing the resolution rebuilds the labels. We argue the bottleneck is the label system, not the mixer, and provide a hierarchical one. HERMES is a data-derived labeling substrate: a Learned Semantic Transform followed by 3-stage residual vector quantization annotates each document once into a coarse-to-fine code whose prefix length controls granularity up to approximately 130k cells. At coarse granularity HERMES sits at a plateau with KMeans-family methods on standard clustering metrics, so the contribution is the substrate, not the clusterer. On 1B-parameter, 25B-token pre-training, the hierarchy exposes an interaction fixed-granularity pipelines cannot test: at one prefix length, a combined Stage-2 rule contrast, equal-subbucket coverage versus size-proportional within-bucket quality top-30%, lifts a 16-task capability macro-average by +0.0253; at the next finer level, the same rule loses its measurable edge as candidate pools contract approximately 5x. HERMES reframes data mixture design from choosing among fixed label sets to navigating a reusable, data-derived granularity hierarchy.
22. 【2607.02262】CheckRLM: Effective Knowledge-Thought Coherence Checking in Retrieval-Augmented Reasoning
链接:https://arxiv.org/abs/2607.02262
作者:Dingling Xu,Ruobing Wang,Qingfei Zhao,Yukun Yan,Zhichun Wang,Daren Zha,Shi Yu,Zhenghao Liu,Shuo Wang,Xu Han,Maosong Sun
类目:Computation and Language (cs.CL)
关键词:Reasoning Language Models, Language Models, significantly improved performance, Reasoning Language, significantly improved
备注: 24 pages, 7 figures
点击查看摘要
Abstract:Reasoning Language Models (RLMs) have significantly improved performance on complex tasks by extending the reasoning chain. However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose CheckRLM, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. Specifically, CheckRLM extracts factual claims from the reasoning chain to identify and localize subtle knowledge inconsistencies during inference. Upon detection of errors, a refinement mechanism performs minimal-cost yet precise corrections by leveraging external knowledge, ensuring coherence between the reasoning chain and correct knowledge. Extensive experiments demonstrate that CheckRLM substantially outperforms existing baselines, exhibiting a strong capability to mitigate error accumulation in long-horizon reasoning with lower costs. The code and data are available at this https URL.
23. 【2607.02259】BamiBERT: A New BERT-based Language Model for Vietnamese
链接:https://arxiv.org/abs/2607.02259
作者:Dat Quoc Nguyen,Thinh Pham,Chi Tran,Linh The Nguyen
类目:Computation and Language (cs.CL)
关键词:BERT-based pre-trained language, pre-trained language model, addresses key limitations, facto Vietnamese text, limitations of PhoBERT
备注:
点击查看摘要
Abstract:In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: this https URL
24. 【2607.02255】AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents
链接:https://arxiv.org/abs/2607.02255
作者:Xiangchen Cheng,Yunwei Jiang,Jianwen Sun,Zizhen Li,Chuanhao Li,Xiangcheng Cao,Yihao Liu,Fanrui Zhang,Li Jin,Kaipeng Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:single memory component, Memory, future decision, contract, simplest contract appends
备注:
点击查看摘要
Abstract:Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game reports zero wins at the lowest difficulty across five configurations, and the developer-reported human win rate at the same difficulty is 16%; the task is hard but not saturated. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. At this sample size the comparison is directional rather than statistically decisive (Fisher exact p\approx0.37); a cross-backbone probe and public accumulating-context baselines are reported as operational comparisons rather than controlled tests of the contract variable itself. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts -- an agent design and a validated, reusable methodology for studying how explicit memory layers shape long-horizon LLM-agent decisions.
25. 【2607.02235】Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages
链接:https://arxiv.org/abs/2607.02235
作者:A.Seza Doğruöz,Xixian Liao,Verena Blaschke,Jakob Prange,Senyu Li,David Ifeoluwa Adelani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:natural language generation, dominant evaluation paradigm, due to shortcomings, language generation tasks, shortcomings of conventional
备注: Under Review
点击查看摘要
Abstract:LLM-as-a-Judge has become the dominant evaluation paradigm for many natural language generation tasks, due to shortcomings of conventional metrics and high correlations with human judgment, albeit mostly in English. There are now attempts to extend LLM-as-a-Judge to multilingual settings including low-resource languages. However, LLMs have limited proficiency in low-resource languages, and there is often no adequate human validation in these settings. To highlight the scope of the problem and current practices, we explore the use of LLM-as-a-Judge evaluators in ACL Anthology papers focusing on multilingual settings and low-resource languages across a diverse set of tasks. Out of 650 papers mentioning LLM-as-a-judge, only 33 of them focus on low-resource or multilingual settings. Our in-depth analysis of these papers indicates inconsistent evaluation outcomes, a tendency to overtrust LLM judgments in multilingual settings, and the widespread reliance on a single judge model per study. To help the NLP community further, we conclude with recommendations about how to use LLM-as-a-Judge in multilingual and low-resource settings.
26. 【2607.02214】Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning
链接:https://arxiv.org/abs/2607.02214
作者:Congrui Du,Yang Zhang,Kaizhi Qian,Shiyu Chang
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:text-based large language, text LLM, substantially more challenging, text-based large, requires learning
备注:
点击查看摘要
Abstract:Instruction tuning for speech language models (SLMs) is substantially more challenging than for text-based large language models (LLMs), as it requires learning a new modality and a wide range of speech-specific instructions in addition to those supported by text LLMs. Existing SLM training approaches largely replicate the text LLM training paradigm by synthesizing large-scale speech pre-training and instruction-tuning datasets. However, this strategy is difficult to scale, since speech sequences are significantly longer than text sequences. In this paper, we propose SpeechCombine, an instruction-following speech language model trained without any instruction tuning, using only a single round of speech pre-training on 30k hours of data. Starting from a text LLM base model, we perform continuous pre-training on speech utterances to obtain a speech-adapted model, and then directly combine its weights with the weight difference between the instruction-tuned and base versions of the text LLM. Our results show that this simple combination strategy not only preserves the knowledge and capabilities of the original text LLM, but also effectively transfers them to the speech domain. These findings suggest a new direction for SLM training that avoids reliance on massive speech data.
27. 【2607.02182】Bayesian Sparse Low-Rank Adaptation for Large Language Model Uncertainty Estimation
链接:https://arxiv.org/abs/2607.02182
作者:Jijie Zhang,Zhe Ren,Quan Zhang,Dandan Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, severely hindering trustworthy, hindering trustworthy deployment, Large language, exhibit remarkable reasoning
备注: Preprint. 16 pages, 7 figures, 6 tables
点击查看摘要
Abstract:Large language models (LLMs) exhibit remarkable reasoning capabilities, but their task-specific fine-tuning is notoriously plagued by overconfidence, severely hindering trustworthy deployment. We propose Data-Adaptive Lower-Rank Adaptation (DALorRA), a simple and effective variational Bayesian sparse framework that shifts the paradigm of uncertainty quantification from the dense parameter space to the lightweight rank level of low-rank adaptation (LoRA). With the insight that LoRA essentially aggregates multiple rank-one components that may provide superfluous model capacity, DALorRA imposes stochastic masking on rank dimensions, enabling Bayesian regularization of model capacity during training and ensemble-like calibration during inference. Extensive experiments demonstrate DALorRA's excellent calibration of LLMs without compromising reasoning accuracy.
28. 【2607.02089】ESC: Emotional Self-Correction for Reliable Vision-Language Models
链接:https://arxiv.org/abs/2607.02089
作者:Tien-Huy Nguyen,Minh-Nhat Nguyen,Nguyen Nhat Huy,Hung Viet Nguyen,Huy Nguyen Minh Nhat,Thanh-Huy Nguyen,Cuong Tuan Nguyen,Hoang M. Le,Dat Nguyen,Phat Kim Huynh,Min Xu,Ulas Bagci
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:diverse multimodal tasks, textbf, achieved strong performance, Vision-language models, performance across diverse
备注: ECCV Main Track 2026 (113 pages, 15 tables, 65 figures). Project Page: [this https URL](https://genai4e.github.io/ESC/?)
点击查看摘要
Abstract:Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{this https URL}.
29. 【2607.02079】HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
链接:https://arxiv.org/abs/2607.02079
作者:Navaneeth Sangameswaran,Preetham S,Ashmiya Lenin
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:open-weights implementation, constitutional-classifier paradigm, paradigm for input, input safety, present HaloGuard
备注: 30 pages, 7 figures, 20 Tables, Link: [this https URL](https://huggingface.co/collections/astroware/haloguard-10)
点击查看摘要
Abstract:We present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm for input safety. It achieves state-of-the-art performance on English and multilingual prompt-safety benchmarks at roughly one-tenth the model size of current leading open guard models. The safety constitution is the organising structure of the corpus: a natural-language constitution of 46 policies and 2,940 subcategories drives synthetic data generation, with exhaustive one-to-one paired counterfactuals that hold topic and vocabulary fixed while flipping intent, a two-tier harmless design that separately targets boundary and baseline false positives (FPs), and balanced multilingual materialisation across 46 languages that treats language as a surface form appearing on both sides of the boundary rather than as an adversarial signal. Across seven prompt-safety benchmarks, HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard we evaluate, outperforming baselines up to 27B parameters (over 30 times larger) while holding false-positive rate (FPR) to 4.3 and false-negative rate (FNR) to 9.5. The HaloGuard 1.0-4B variant reaches average F1 of 92.1 and FPR of 3.5, spending its extra capacity on precision rather than recall. A structured adjudication of the remaining failures indicates that most apparent missed-harm cases are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against both content-level and agentic attacks. We release the models as open weights.
30. 【2607.02049】SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses
链接:https://arxiv.org/abs/2607.02049
作者:Anna Chorna
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large Language Models, Models are increasingly, Large Language, Language Models, increasingly deployed
备注: 19 pages, 5 figures, 3 tables. Benchmark paper introducing SPLIT for evaluating empathy, linguistic naturalness, and cultural grounding in English and Ukrainian LLM responses
点击查看摘要
Abstract:Large Language Models are increasingly deployed in emotional-support contexts and crisis-related situations. Nevertheless, their cross-lingual abilities in these circumstances remain underexplored. Existing benchmarks emphasize multilingual performance but rarely examine crisis-related empathy and cultural grounding in low-to-mid-resource languages. We introduce SPLIT, a 500-prompt benchmark designed to evaluate LLM consistency in generating emotionally grounded responses across five categories: Stress, Panic, Loneliness, Internal Displacement, and Tension. We evaluate three technically diverse LLMs across three dimensions: Empathetic Accuracy, Linguistic Naturalness, and Contextual Cultural Grounding. The framework aims to assess and compare the quality of LLM responses in both English and Ukrainian languages, as well as to explore the reliability of the LLM-as-a-jury paradigm. Our findings reveal that Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade when transitioning to Ukrainian, while DeepSeek-V3 remains comparatively stable within our benchmark. We additionally find that human and AI evaluators agree weakly on empathy and naturalness but diverge on cultural grounding. We further argue that producing Ukrainian text is not equivalent to producing Ukrainian emotional support. Our findings may assist in the future development of more culturally tailored benchmark designs, as well as encourage a stronger emphasis on human-centered evaluation.
31. 【2607.02047】OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
链接:https://arxiv.org/abs/2607.02047
作者:Rheeya Uppaal,Seungwoo Lyu,Selina Sung,Junjie Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enabling harm, Safe completion requires, Safe, isolated prompts, Abstract
备注: Preprint
点击查看摘要
Abstract:Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. Our results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.
32. 【2607.02032】PACE: A Proxy for Agentic Capability Evaluation
链接:https://arxiv.org/abs/2607.02032
作者:Yueqi Song,Lintang Sutawika,Jiarui Liu,Lindia Tjuatja,Jiayi Geng,Yunze Xiao,Daniel Lee,Aditya Bharat Soni,Vincent Lo,Xiang Yue,Graham Neubig
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Evaluating LLM agents, requires complex infrastructure, Evaluating LLM, SWE-Bench and GAIA, complex infrastructure
备注:
点击查看摘要
Abstract:Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits a regression that maps a model's scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementary instance-selection strategies, target-relevance local selection and globally informative global selection. We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show that PACE-Bench predicts agentic scores with leave-one-out cross-validation (LOOCV) mean absolute error (MAE) under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.
33. 【2607.02007】EduArt: An educational-level benchmark for evaluating art history knowledge in large language models
链接:https://arxiv.org/abs/2607.02007
作者:Gianmarco Spinaci,Lukas Klic,Giovanni Colavizza
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:aggregate measures reveal, Large language models, Large language, single disciplines, aggregate measures
备注:
点击查看摘要
Abstract:Large language models now score near ceiling on general benchmarks, but these aggregate measures reveal little about how models behave within single disciplines. Existing art-focused evaluations rely on synthetic questions and rarely report item-level properties. This paper introduces EduArt, an educational-level benchmark for art-historical knowledge and visual reasoning in multimodal LLMs. EduArt comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams, spanning two languages and seven formats from multiple choice to in-text word placement and error identification. Twelve models from six provider families were evaluated under a default answer-only condition and a motivation condition requiring written justification, and characterized using Classical Test Theory and a logistic regression isolating the effects of format, language, image presence, and model. The benchmark showed strong psychometric properties (mean discrimination 0.514, 82.3 percent good discriminators), while multiple-choice accuracy saturated near ceiling for six models, showing recognition formats alone cannot distinguish frontier models. Format was a strong independent predictor of accuracy: models exceeding 94 percent on multiple choice fell to 23.9 percent on open completion (Claude Opus 4.6) and 6.2 percent on error identification (Claude Sonnet 4.6). The motivation condition changed accuracy in a predominantly negative, family-dependent direction. These dissociations indicate that art-historical knowledge and the ability to deploy it are distinct capabilities, and that single-format benchmarks overestimate what models can reliably do. Mapping this capability profile is a precondition for responsible use of multimodal LLMs in art-historical scholarship, where tasks demand producing and manipulating content rather than selecting from fixed options.
34. 【2607.02002】Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words
链接:https://arxiv.org/abs/2607.02002
作者:Xiaoyun Jin,Mirjam Ernestus,R.Harald Baayen
类目:Computation and Language (cs.CL)
关键词:contextualized embeddings, predictable in part, Mandarin words, Mandarin, Time-normalized
备注:
点击查看摘要
Abstract:Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.
35. 【2607.01978】Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing
链接:https://arxiv.org/abs/2607.01978
作者:Siyuan Li,Youyuan Zhang,Ruitong Liu,Junxi Wang,Jing Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, large language models, Online multimodal knowledge, multimodal knowledge editing, knowledge editing requires
备注:
点击查看摘要
Abstract:Online multimodal knowledge editing requires injecting a continual stream of visual-textual corrections into multimodal large language models (MLLMs) with bounded overhead and minimal disruption to unrelated behaviors. Existing editors mainly emphasize edit reliability and long-horizon stability, but rarely control the semantic boundary of each edit. Our pilot analyses of post-edit behaviors and internal neuronal activities reveal a scope gap behind reliable edits: instance-level success neither guarantees transfer to valid cross-modal variants nor prevents leakage to unrelated inputs, while edit-related cross-modal responses concentrate in deeper semantic layers. Therefore, we formulate Edit-Scoped Generalization, reframing online MLLM editing from merely correcting an instance to controlling the propagation boundary of each edit. To this end, we propose ScopeEdit, a scope-aware online editor that decomposes each update into a modality-local absorption branch and an evidence-gated shared generalization branch. The local branch supports stable edit absorption, whereas the shared branch enables cross-modal propagation only when visual and textual evidence are sufficiently aligned. Both branches perform scope-separated write geometries in orthogonal low-rank spaces and maintain branch-wise preconditioners via Sherman--Morrison recursions, yielding constant per-edit overhead. Extensive experiments across diverse benchmarks, long-horizon edit streams, MLLM backbones, real-world VLKEB scenarios, and complex vision-language architectures show that ScopeEdit consistently improves the trade-off between in-scope cross-modal transfer and out-of-scope locality, while preserving edit reliability, stability and online efficiency. Our code is available at this https URL.
36. 【2607.01972】Object Aligner: A Configurable JSON Schema Similarity Score for Graphs, Applied to LLM Prompt Optimization
链接:https://arxiv.org/abs/2607.01972
作者:Jan Drchal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, powering information extraction, produce JSON conforming, Large language, Object Aligner
备注: 28 pages, This is a submitted version of a manuscript under review at IEEE Access; it has not been peer reviewed
点击查看摘要
Abstract:Large language models (LLMs) are often asked to produce JSON conforming to a fixed schema, powering information extraction, tool calling, agentic planning, and knowledge-graph construction. Measuring how closely an output matches a gold reference is essential yet surprisingly hard: exact match is brittle, text similarity ignores structure, and an LLM judge is expensive, opaque, and non-deterministic. We address this with Object Aligner (OA), an open-source Python library that scores two JSON objects deterministically by recursively aligning their trees (the Hungarian algorithm for unordered collections, sequence alignment for ordered ones) and awarding partial credit at the granularity the schema declares. The Object Aligner is configured entirely through a set of JSON Schema extensions, so adapting it to a new task involves annotating a schema rather than writing code. Complex structured data, however, are rarely flat trees: records may form graphs or hypergraphs keyed by arbitrary identifiers, breaking the assumptions of prior similarity metrics. Our central contribution, referential alignment, closes this gap by inferring a bijection between gold and candidate identifiers and scoring every reference through it, so the score is invariant to relabeling. Since recovering this bijection exactly is graph isomorphism, the Object Aligner approximates it with Weisfeiler-Leman color refinement. An order-sensitive sequence regime targets ranking and planning. Since the same alignment localizes every mismatch, the Object Aligner emits ranked repair suggestions at no extra cost. Used as a reward inside the GEPA prompt optimizer, Object Aligner helps or stays neutral across all datasets.
37. 【2607.01965】owards a Phonology-Informed Evaluation of Multilingual TTS
链接:https://arxiv.org/abs/2607.01965
作者:Sneha Ray Barman,Neeraj Kumar Sharma,Shakuntala Mahanta
类目:Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
关键词:Neural TTS systems, Neural TTS, natural across languages, grammatical forms, sound natural
备注: Accepted at Interspeech 2026
点击查看摘要
Abstract:Neural TTS systems can sound natural across languages, but naturalness does not guarantee the preservation of sound contrasts that distinguish words from their grammatical forms. Standard metrics like MOS do not test for this. We propose a classifier-based framework that audits TTS output against language-specific phonological patterns using human speech as a benchmark. Testing Assamese advanced tongue root (ATR) vowel harmony with Meta's MMS TTS, we show that a classifier trained on human speech transfers to synthesized speech with minimal loss. The faithfulness audit reveals that [+ATR] mid vowels are realized as [-ATR] in 1/3 tokens despite an underlying [+ATR] specification, a bias absent in human speech. At the word level, predicted ATR labels classify harmony more accurately than transcription labels, indicating a gap between intended and produced phonology. The framework offers task-specific diagnostics and generalizes to other phonological contrasts with measurable acoustic cues.
38. 【2607.01964】Beyond Supervised Clarification: Input Rewriting with LLMs for Dialogue Discourse Parsing
链接:https://arxiv.org/abs/2607.01964
作者:Yiming Liu,Ziyue Zhang,Zhichao Xu,Xin Yu,Yingheng Tang,Tianyu Jiang,Jie Cao
类目:Computation and Language (cs.CL)
关键词:modern NLP pipelines, modern NLP, improve frozen downstream, frozen downstream models, common strategy
备注: Accepted to SIGDIAL 2026. 17 pages, 2 figures
点击查看摘要
Abstract:Rewriting inputs to improve frozen downstream models has become a common strategy in modern NLP pipelines. Prior work on incremental dialogue discourse parsing (DDP) shows that supervised clarification models can rewrite fragmentary or underspecified utterances, such as resolving ellipsis or references, to improve parsing accuracy. In this work, we revisit this idea under realistic deployment conditions, where no clarification supervision is available and the clarifier must rely on zero-shot prompting or feedback from a frozen parser. Across three Segmented Discourse Representation Theory (SDRT) datasets and multiple parsers, we find that last-utterance clarification is far less reliable than suggested by supervised settings. Parser-agnostic rewriting often introduces more regressions than repairs, as edits that enable fixes also disrupt discourse cues relied upon by the parser. A best-of-8 rewriting analysis further reveals a practical ceiling: a large fraction of errors are not repairable through input rewriting alone. A parser-aware clarifier trained with GRPO reduces regressions by up to 37% by learning conservative abstention, yet still fails to produce selectivity-aware clarifications that consistently improve parsing. Together, these findings recast clarification as a selective intervention problem. We identify rewritability prediction, deciding whether an utterance is repairable before intervention, as the key missing capability for input-side optimization of frozen discourse parsers, and a critical direction for improving agentic pipelines more broadly.
39. 【2607.01960】NAVER LABS Europe Submission to the Instruction-following 2026 Short Track
链接:https://arxiv.org/abs/2607.01960
作者:Marcely Zanon Boito,Hemant Yadav,Jean-Luc Meunier,Ioan Calapodescu
类目:Computation and Language (cs.CL)
关键词:NAVER LABS Europe, describe NAVER LABS, LABS Europe submission, NAVER LABS, LABS Europe
备注: IWSLT 2026 system paper
点击查看摘要
Abstract:In this paper, we describe NAVER LABS Europe's submission to the instruction-following speech processing short track at IWSLT 2026. We participate again in the constrained setting, developing systems capable of jointly performing ASR, ST, and SQA from English speech into Chinese, Italian, and German. Building on our previous submission, ranked first in last year's short track, we update our multi-stage training pipeline by replacing the speech projector with SpeechMapper, a method for learning a speech-to-LLM embedding projector using only ASR data. In addition, we introduce a synthetic SQA dataset, fakACL, composed of artificially generated scientific presentations. This dataset is built by prompting the LLM backbone, segmenting the generated talks, and synthesizing speech with SeamlessM4T-large-v2. The combination of an improved speech projection mechanism and domain-specific synthetic data allows our model to outperform last year's best short-track system, while being considerably more compact and relying on a weaker LLM backbone. This year's results place our system tied for first place in the overall short track ranking.
40. 【2607.01938】PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation
链接:https://arxiv.org/abs/2607.01938
作者:Peng Yun,Shouwang Huang,Hao Li,Jinxi Li,Jianan Wang,Bo Yang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:environments remains challenging, dynamically moving targets, Manipulating fast, targets in unstructured, environments remains
备注: ECCV 2026. Code and data are available at: [this https URL](https://github.com/vLAR-group/PhysMani)
点击查看摘要
Abstract:Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.
41. 【2607.01934】AIriskEval-edu: New Dataset for Risk Assessment in AI-mediated K-12 Educational Explanations
链接:https://arxiv.org/abs/2607.01934
作者:Javier Irigoyen,Roberto Daza,Francisco Jurado,Julian Fierrez,Ruben Tolosana,Alvaro Ortigosa,Enrique Blas,Aythami Morales
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:evaluate auditors based, work introduces, content for grades, designed to train, auditors based
备注: 6 pages, 2 figures. Accepted at the IEEE International Carnahan Conference on Security Technology (ICCST 2026), October 14, 2026
点击查看摘要
Abstract:This work introduces AIriskEval-edu-db2, a new dataset designed to train and evaluate auditors based on LLMs for an explainable pedagogical risk assessment in instructional content for grades K-12. The dataset comprises 1,639 explanations from 170 curated ScienceQA questions, covering science, language arts, and social sciences. For each question, the dataset includes an explanation written by a human teacher alongside 11 explanations generated by LLM-simulated teacher profiles associated with distinct pedagogical risks. We propose a comprehensive risk rubric aligned with established educational standards that covers five complementary dimensions: factual precision, depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. A key contribution is the addition of 785 explanations with structured explainability annotations, including risk localization and risk description. The annotations are produced through a semi-automatic process with expert teacher validation. Finally, we present validation experiments comparing state-of-the-art proprietary models with a lightweight local Llama 3.1 8B model in both the pedagogical risk detection and the explainability assessment. These experiments evaluate whether supervised fine-tuning on AIriskEval-edu-db2 enables a locally deployable model to approach or outperform stronger frontier models while preserving privacy in educational auditing and assessment tasks.
42. 【2607.01927】UDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B
链接:https://arxiv.org/abs/2607.01927
作者:Baran Bingol,Bahaeddin Turkoglu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Türkçe Düşünen Üretken, Düşünen Üretken Model, Türkçe Düşünen, Düşünen Üretken, paper presents TUDUM
备注:
点击查看摘要
Abstract:This paper presents TUDUM (Türkçe Düşünen Üretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated think.../think block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.
43. 【2607.01899】he Grammar Does the Work: Functional vs. Lexical Dependency Length Minimization Across Universal Dependencies
链接:https://arxiv.org/abs/2607.01899
作者:Kim Gerdes(LISN, Qatent, STL)
类目:Computation and Language (cs.CL)
关键词:syntactic relation types, previous studies report, dependency distance, Dependency length minimization, well-documented processing universal
备注:
点击查看摘要
Abstract:Dependency length minimization (DLM) is a well-documented processing universal, but previous studies report a single mean dependency distance (MDD) per language, obscuring variation across syntactic relation types. We analyze 122 languages in UD and SUD (version 2.17), showing that DLM operates on two distinct levels. Grammar-driven optimization targets functional dependencies (det, case, aux), which are universally short (mean 1.71, $\sigma$ = 0.33) and invariant across typologically diverse languages. Processing-driven optimization operates on lexical dependencies (nsubj, obj, obl), which are longer (mean 2.87), highly variable ($\sigma$ = 0.63), and constrained by word-order typology. This asymmetry holds in SUD despite reversed head direction (r = 0.92). We conclude that ''the grammar does the work'' of minimization by scaffolding sentences with local functional attachments, leaving processing pressures to determine the ordering of lexical heads.
44. 【2607.01893】Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters
链接:https://arxiv.org/abs/2607.01893
作者:Tianjian Yang,Meng Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Speculative decoding accelerates, target model verifies, decoding accelerates autoregressive, accelerates autoregressive generation, Speculative decoding
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Speculative decoding accelerates autoregressive generation by drafting a block of tokens that the target model verifies left-to-right, committing only the longest accepted prefix. Block (DLM-style) drafters predict the whole block in parallel, which is fast but trained with a full-block cross-entropy that supervises every position against the gold continuation -- even though inference discards every token after the first rejection. Recent acceptance-aware objectives patch this by reweighting the full-block loss; we instead use teacher-forced learning as a motivation for how supervision should concentrate on the accepted prefix. A mask-only block drafter has no input-side channel for gold-prefix conditioning, so AUF approximates that prefix-sensitive supervision on the loss side by keeping the cross-entropy support only through the drafter's first predicted failure. AUF is a single, detached change to the CE support -- no auxiliary objective, no verifier rollouts, and no change to the inference pipeline or the exactness contract. Within fixed drafter backbones and serving settings on Qwen3-8B, AUF raises the DFlash drafter's average emitted length $\tau$, averaged over six benchmarks, from 2.40 to 2.61, with a gain on every benchmark, and transfers to Domino's two-branch head (2.56 to 2.68). Two findings sharpen the picture: the decay-only baseline reaches higher token accuracy on the shared block mask yet decodes worse, and on DFlash, once AUF truncates the support, the standard exponential position-decay weighting becomes empirically inert.
45. 【2607.01883】PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation
链接:https://arxiv.org/abs/2607.01883
作者:Junhao Chen,Xiang Li,Mingjin Chen,Boran Zhang,Henghaofan Zhang,Yibin Xu,Yuehan Cui,Fangsheng Weng,Fei Ma,Qi Tian,Ruqi Huang,Hao Zhao
类目:Computation and Language (cs.CL)
关键词:large language models, language models generate, models generate structured, generate structured artifacts, CAD models
备注: Accepted by ACL 2026. Project Page: [this https URL](https://yisuanwang.github.io/PairCoder/)
点击查看摘要
Abstract:Code is the medium through which large language models generate structured artifacts: charts, scientific figures, vector graphics, CAD models, 3D scenes, and hardware designs are all produced by writing programs. In this regime single pass inference is brittle, because the compiler, renderer, or simulator that decides whether the artifact exists is invisible to the model. We present PairCoder, which grounds review in the toolchain and realizes it as two agent pair programming: a Driver agent writes the program, a Navigator agent reviews it against verification evidence (diagnostics, execution results, and renderings of the current artifact beside the target), and the two switch roles when errors persist. Across 17 public benchmarks and seven models from three vendors, PairCoder improves essentially every benchmark whose artifact is verifiable, on full official metric suites rather than execution alone (for example, Blender scene executability 0.20 to 0.78; TikZ compile rate up 10 to 30 points on every model), at 2.9 to 9.2 times single model cost (about 7 times overall). The improvements concentrate where the toolchain provides an informative oracle and the baseline leaves headroom, and the method ties or mildly regresses where the oracle is weak; we frame pair programming as a reliable recipe for verified code driven generation.
46. 【2607.01874】SkillCoach: Self-Evolving Rubrics for Evaluating and Enhancing Agentic Skill-Use
链接:https://arxiv.org/abs/2607.01874
作者:Jiayin Zhu,Kelong Mao,Yudong Guo,Dengbo He,Sulong Xu,Simiu Gu,Yutao Yue
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reusable operational layer, layer for LLM, encoding SOPs, domain rules, LLM agents
备注:
点击查看摘要
Abstract:Skills are becoming a reusable operational layer for LLM agents, encoding SOPs, domain rules, tool workflows, scripts, and validation routines. In realistic skill repositories, overlapping skills make reliable skill-use difficult. Final verifier success is too coarse for both evaluation and training, since an agent may pass through trial and error while selecting distractor skills, skipping required steps, composing workflows incorrectly or omitting final checks. We introduce SkillCoach, a self-evolving rubric framework for evaluating and enhancing agentic skill-use. SkillCoach derives skill-grounded process rubrics from real rollouts and evaluates trajectories along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It keeps the external verifier as a separate outcome signal, allowing process quality to be distinguished from accidental task success. The evolved rubrics further serve as process supervision for selecting high-quality training trajectories. Experiments show that evolved rubrics substantially improve evaluation quality, expose failures hidden by final accuracy, and provide stronger supervision signals than outcome-only filtering for enhancing agentic skill-use.
47. 【2607.01859】Safety Targeted Embedding Exploit via Refinement
链接:https://arxiv.org/abs/2607.01859
作者:Joshua Adrian Cahyono
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Targeted Embedding Exploit, Safety Targeted Embedding, leaving uncertain, large language models, conducted predominantly
备注:
点击查看摘要
Abstract:Safety training for large language models (LLMs) is conducted predominantly in English, leaving uncertain how well safety mechanisms generalize to low-resource languages and mixed-language code-switching. We show that this creates an epistemic gap in which models confidently generate harmful responses for inputs that fall outside the distribution of their safety training. To study this phenomenon, we introduce STEER (Safety Targeted Embedding Exploit via Refinement), a gradient-guided attack that identifies words contributing most strongly to the model's refusal behavior and iteratively translates them into low-resource languages to suppress refusal while preserving harmful intent. Across six open-source 8B-parameter models, STEER achieves attack success rates of up to 93.0% on JailbreakBench and 96.7% on AdvBench, outperforming random code-switching and Greedy Coordinate Gradient (GCG). The resulting prompts also transfer to GPT-4o-mini, achieving a 35.5% attack success rate without requiring access to the target model, suggesting that the underlying weakness is not specific to a single architecture. These findings demonstrate that safety mechanisms aligned primarily on English cannot be assumed to generalize across multilingual inputs. We argue that improving multilingual safety requires broader coverage during alignment and mechanisms that explicitly detect and abstain on out-of-distribution inputs.
48. 【2607.01852】Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
链接:https://arxiv.org/abs/2607.01852
作者:Valentin J. J. Kreileder,Johannes Reisinger,Andreas Fischer
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Augmented Generation Assessment, Retrieval Augmented Generation
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems use the question-answering capabilities of Large Language Models (LLMs) to access information outside their parameters. We evaluate if cluster-based semantic chunking improves retrieval and answer quality compared to fixed-size and recursive chunking evaluating on long, structured academic theses using the Retrieval Augmented Generation Assessment (RAGAs) framework. RAGAs based faithfulness shows limited reliability in this setup. Performance on fixed versus document specific questions varied substantially, likely related to the formatting of documents and preprocessing. Under the tested configuration, cluster-based chunking did not outperform simpler strategies.
49. 【2607.01833】Non-synchronism in Global Usage of Research Methods in Library and Information Science from 1990 to 2019
链接:https://arxiv.org/abs/2607.01833
作者:Chengzhi Zhang,Liang Tian
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:research methods, research, Information Science, methods, countries
备注:
点击查看摘要
Abstract:The global development of Library and Information Science (LIS) is influenced by various factors such as the economy, society, culture, discipline, tradition, and more. Consequently, the research methods of LIS vary greatly among countries. To better understand these differences, we conducted a study of 5,281 research papers from 81 countries published in internationally representative journals over the past thirty years. We manually annotated the research methods used in some articles through content analysis, and subsequently developed and trained a deep learning model for automatic classification of research methods. Using this method, we conducted a comparative analysis of the usage of research methods in different countries. Our findings reveal that there are differences in the research methods used across countries, with each country having its unique research profile and distribution of research methods. Even when investigating the same topic, research methods can differ between countries. Our study also uncovers that there are differences between the national and international distribution of research methods, these differences have decreased over the past 30 years. By highlighting the characteristics of discipline development in various countries from the perspective of research methods, our study can help guide discipline development at the national level. This study provides insights into the usage trends of research methods across different countries and highlights the unique characteristics of discipline development in each country. This information can be valuable in promoting collaboration and understanding between countries and in guiding discipline development at the national level.
50. 【2607.01829】Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
链接:https://arxiv.org/abs/2607.01829
作者:Alex Brooker,Tim Hughes
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, customer facing assistants, Large language, aviation business operations, facing assistants
备注: 9 pages, 1 figure, 2 tables. Benchmark available in inspect_evals (UKGovernmentBEIS/inspect_evals)
点击查看摘要
Abstract:Large language models (LLMs) are increasingly proposed for aviation business operations, from documentation and training generation to customer facing assistants. General purpose benchmarks do not measure whether a model reasons safely and correctly about aviation specific operational knowledge, and the high stakes, regulated nature of the domain makes that gap consequential. We present Pre-Flight, an open source benchmark of 300 multiple choice questions drawn from international standards and airport ground operations material, covering international airport ground operations, ICAO and US FAA regulations, aviation general knowledge and complex operational scenarios. Questions were authored and reviewed by practitioners with experience in air traffic management, ground operations and commercial flying. We evaluate a range of contemporary commercial and open weight models using the Inspect evaluation framework, scoring by accuracy under a standard multiple choice protocol, and we maintain the leaderboard on a rolling basis as new models are released. Against an informal expert reference of around 95%, obtained from a low sample quiz of aviation professionals at a conference, even the strongest model evaluated (released in 2026) reaches 82.7%, having improved only gradually from roughly 75% in early 2025. A substantial and persistent gap below expert level reliability therefore remains. We release the dataset, the evaluation harness and the results, and the benchmark is available within the community evaluations package distributed with inspect_evals. We argue that domain specific evaluation of this kind is a necessary precondition for responsible deployment of generative AI in non safety critical aviation operations.
51. 【2607.01828】Gender Differences in Research Topic and Method Selection in Library and Information Science: Perspectives from Three Top Journals
链接:https://arxiv.org/abs/2607.01828
作者:Chengzhi Zhang,Siqi Wei,Yi Zhao,Liang Tian
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:men prefer quantitative, opting for qualitative, prefer quantitative methods, Research, social sciences
备注:
点击查看摘要
Abstract:Research in the social sciences has shown that there are gender differences in the selection of research methods, with women often opting for qualitative methods while men prefer quantitative methods. However, it is important to consider that research methods are generally chosen based on the research topic. To figure out the influence of gender on research method selection, a study was conducted in the field of Library and Information Science, using a more fine-grained method classification system and an automatic classification model called CogFT, which is based on full-text cognition. The findings showed that women tend to use Interview while men prefer Theoretical approach, across a range of topics. The study offers insights into the specific research design processes that contribute to gender differences in method selection and suggests ways to promoting gender inclusivity and equality in academia by considering research method use and guidance.
52. 【2607.01802】On the Limits of Steering Vectors for Preference-Aligned Generation
链接:https://arxiv.org/abs/2607.01802
作者:Melanie Subbiah,Zara Hall,Kathleen McKeown
类目:Computation and Language (cs.CL)
关键词:controlled text generation, shaping model outputs, offering interpretable, text generation, training-free mechanisms
备注:
点击查看摘要
Abstract:Steering vectors have emerged as a promising approach to controlled text generation, offering interpretable, training-free mechanisms for shaping model outputs. However, their practical generality remains poorly understood. We study the limits of steering vector generalization along three dimensions: trait expressibility, task transfer, and multi-trait composition. Using the PLUME writing personalization benchmark, we extract steering vectors for a range of preferences and evaluate them on summarization and email-writing tasks across two open-source models (Qwen2.5-7B-Instruct and Llama3.1-8B-Instruct). We find that steering effectiveness varies substantially across traits. We further show that steering effectiveness can degrade when vectors extracted from positive and negative style examples are transferred to downstream writing personalization tasks. Finally, we compare common methods for composing multiple steering vectors and find that all methods suffer significant drops in trait expression as more vectors are added, with a tradeoff between coherence and expressibility that requires per-setting hyperparameter tuning. Taken together, our results suggest that steering vectors face meaningful limits as a general-purpose tool for preference alignment.
53. 【2607.01800】Do LLMs Truly Generalize in the Molecular Domain? A Perturbation-Based Analysis
链接:https://arxiv.org/abs/2607.01800
作者:Jiatong Li,Weida Wang,Changmeng Zheng,Shufei Zhang,Yatao Bian,Xiao-yong Wei,Qing Li
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, recently shown promise, discrete sequential tokens
备注: 21 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have recently shown promise in molecular discovery, yet a gap remains between their probabilistic nature over discrete sequential tokens and the rigid topological constraints of chemical space. This raises the question of whether molecular LLMs can generalize beyond the local neighborhoods induced by their sequence-based representations. To systematically investigate this question, we introduce a Molecular Perturbation framework that generates syntax-valid structural variants of training molecules under controlled Graph Edit Distance (GED) to probe the manifold regularity of molecular LLMs. Our analysis shows that even a single edit can cause substantial performance drops on common molecular tasks, revealing a narrow local trust region and fragile sensitivity to structural changes. Since similar molecules tend to exhibit similar properties, In-Context Tuning (ICT), which anchors predictions on structurally similar molecules, offers a natural way to mitigate such fragility. Our experiments also examine whether ICT confers robustness under controlled structural perturbations, and the results suggest that it can partially expand the local trust region and offer a promising direction for stabilizing molecular LLMs against structural variation.
54. 【2607.01792】PARTREP: Learning What to Repeat for Decoder-only LLMs
链接:https://arxiv.org/abs/2607.01792
作者:Andikawati P Widjaja,Yongjun Kim,Hyounghun Kim,Jaeho Lee
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:natural language tasks, decoder-only LLMs excel, asymmetric information flow, information flow induced, language tasks
备注: 15 pages and 7 figures (including appendix)
点击查看摘要
Abstract:While decoder-only LLMs excel at a vast array of natural language tasks, it suffers from an asymmetric information flow induced by causal attention: later tokens are richer in contextual grounding than earlier ones. A simple and effective remedy is prompt repetition -- just appending a second copy of prompt before generation can redistribute grounding across positions and improve reasoning performance. However, full repetition of the original prompt doubles the KV cache footprint and quadruples attention cost during prefill, making it impractical for long-context settings. We propose PartRep, a selective augmentation method that appends only the most informative tokens -- rather than the entire prompt. We use token-wise negative log-likelihood (NLL) as a selection signal, motivated by the hypothesis that less predictable tokens are less recoverable from surrounding context and therefore benefit more from late-position repetition. To avoid the heavy cost of a full forward pass for scoring, we train a lightweight gate that predicts high-NLL tokens from early-layer hidden states, enabling token selection during mid-prefill via early exit. Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.
55. 【2607.01774】Subliminal Clocks: Latent Time Modelling in Diffusion Language Models
链接:https://arxiv.org/abs/2607.01774
作者:Maximo Rulli(1),Thomas Fontanari(1),Simone Petruzzi(1),Federico Alvetreti(1),Giorgio Strano(1),Donato Crisostomi(1),Giorgos Nikolaou(2),Tommaso Mencattini(2),Andrea Santilli(3),Emanuele Rodolà(1),Simone Scardapane(1),Alessio Devoto(3) ((1) Sapienza University of Rome, (2) EPFL, (3) Independent researcher)
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Diffusion Language Models, Diffusion Language, Language Models, recently emerged, promising alternative
备注: Equal contribution: Thomas Fontanari and Simone Petruzzi
点击查看摘要
Abstract:Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive models. Unlike standard diffusion-based approaches, DLMs are not explicitly conditioned on a timestep, raising a natural question: do these models internally represent denoising progress, and how is such information used downstream? In this work, we show that DLMs do in fact encode a latent representation related to the diffusion timestep within their residual streams. We find that this signal can be reliably extracted using probes across layers, indicating that denoising progress is decodable from internal activations. We further demonstrate that steering the model along a low-dimensional subspace associated with the inferred timestep allows us to systematically modulate its notion of denoising progress, leading to predictable changes in model confidence and entropy. Finally, we analyse the geometry of the identified representation, showing that it exhibits structured and interpretable properties in activation space, and shedding light on how such a signal is processed by these models.
56. 【2607.01763】Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
链接:https://arxiv.org/abs/2607.01763
作者:Meng Wang,Haohan Zhao,Wenzhuo Liu,Lu Yang,Geng Liu,Haiyang Guo,Guo-Sen Xie,Gaofeng Meng,Hongbin Liu,Fei Zhu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:enables foundation models, preserving existing capabilities, post-training enables foundation, enables foundation, foundation models
备注:
点击查看摘要
Abstract:Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at this https URL.
57. 【2607.01733】Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving
链接:https://arxiv.org/abs/2607.01733
作者:Ruchao Fan,Yiming Wang,Rui Zhao,Liliang Ren,Keqi Deng,Xiaoyang Chen,Ali Zare,Bo Ren,Yuxuan Hu,Junkun Chen,Yan Huang,Yelong Shen,Jinyu Li
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:shown promising results, leveraging extensive textual, remain unclear, extensive textual pretraining, integration has shown
备注:
点击查看摘要
Abstract:Speech-LLM integration has shown promising results by leveraging extensive textual pretraining, yet its specific benefits for automatic speech recognition (ASR) remain unclear. We observe that as supervised ASR training data increases, the contribution of LLM priors becomes less evident, and simple speech-text joint training under-utilizes textual knowledge. We therefore propose Joint Speech-Text Interleaved Pretraining (JSTIP), an ASR-oriented pretraining strategy that constructs word-level and segment-level interleaved speech-text sequences within aligned pairs for speech-LLM architectures that accept continuous inputs. Experiments on 38k hours of ASR data show consistent entity accuracy improvement compared to ASR-only and joint speech-text training baselines. JSTIP achieves on-par entity recognition performance using domain transcription text compared to synthetic speech-text pairs, simplifying domain adaptation. Benefiting from textual pretraining and domain text data, JSTIP is competitive with open-source ASR and Speech-LLM systems in medical entity recognition. The zero-shot speech question answering behaviors further suggest that interleaving reduces the speech-text modality gap and preserves the LLM generative prior, which is likely the reason for the entity improvements on the ASR task.
58. 【2607.01728】Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
链接:https://arxiv.org/abs/2607.01728
作者:Licheng Zhang,Bach Le,Pengtao Zhao,Naveed Akhtar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:standard quality assurance, quality assurance step, modern software release, Visual regression testing, software release pipelines
备注:
点击查看摘要
Abstract:Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.
59. 【2607.01727】When Does Generating More Help? Disentangling Fixed-Source Synthesis from Source Expansion in Synthetic Data Scaling
链接:https://arxiv.org/abs/2607.01727
作者:Xu Guo,Jian Tong,Zhihui Lu,Qipeng Guo
类目:Computation and Language (cs.CL)
关键词:Source Expansion, Synthetic data, FSS, materials or generators, Source
备注:
点击查看摘要
Abstract:Synthetic data can be scaled along two routes: Source Expansion (SE), which enlarges the source by adding seed materials or generators, and Fixed-Source Synthesis (FSS), which holds the source fixed and scales the generation budget. Existing scaling studies typically expand the source as the data grows, conflating SE with FSS and leaving FSS underexplored. We isolate FSS by holding the seed-question pool and teacher model fixed, varying only the per-question response budget under Rejection Sampling (RS). We adapt the rectified scaling law to FSS, deriving it from how repeated sampling covers a fixed source. Empirically, the derived form, fit on low budgets, predicts performance at the held-out highest budget for every evaluated teacher--student pair. At matched total-sample budgets, SE and FSS are comparable at small budgets; at large budgets, adding seed questions outperforms spending the same budget on more responses. Within FSS, however, neither synthesizing additional questions from the existing seeds nor varying the synthesis protocol outperforms plain RS at matched budgets. FSS is thus a bounded scaling axis and a controlled setting for comparing synthesis protocols. We will release our code and data to facilitate further research.
60. 【2607.01690】Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing
链接:https://arxiv.org/abs/2607.01690
作者:Joshua Penman
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Negation Neglect, documents' core claims, Goggles, explicitly annotated, documents' core
备注: 20 pages, 10 figures, 2 tables. Code at [this https URL](https://github.com/JoshuaSP/epistemic-goggles) and generated documents, questions, and teacher rollouts at [this https URL](https://huggingface.co/datasets/joshuapenman/epistemic-goggles-artifacts)
点击查看摘要
Abstract:Finetuning a language model on documents that are explicitly annotated as fictional results in a model that still actually believes the documents' core claims, an effect known as Negation Neglect. In our evaluations, models trained on documents prefixed and suffixed with such annotations correctly identify the relevant claims as fictional only about 9% of the time. To address this, we introduce Goggles, a learned module that intervenes on the finetuning gradient rather than the data. During supervised finetuning, a Goggles module edits the gradients an LLM LoRA receives, imparting a chosen epistemic frame (the stance the model takes toward the nature of what it reads) to whatever the documents teach. A Goggles instance is trained once for a given base model, frame, and LoRA configuration, then applied frozen to documents it was never trained on. Trained through Goggles on those same documents, now carrying no fictional annotation, the model flags the content as fictional roughly 91% of the time, while preserving capability (GPQA and TruthfulQA match or exceed baseline). The same architecture supports other frames: a Goggles instance can be trained to treat documents as "part of an AI safety evaluation by Redwood Research" rather than simply as fiction. The imparted frame persists under continued finetuning that pushes back toward the claim, where prior interventions revert. Goggles suggests a path toward training language models on known-misaligned data without absorbing the behaviors that data demonstrates.
61. 【2607.01647】AgenticDataBench: A Comprehensive Benchmark for Data Agents
链接:https://arxiv.org/abs/2607.01647
作者:Zhaoyan Sun,Shan Zhong,Daizhou Wen,Jiaxing Han,Guoliang Li,Ying Yan,Peng Zhang,Yu Su,Xiang Qi,Baolin Sun,Chengyuan Yang,Tao Fang,Huaiyu Ruan
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:heterogeneous raw data, Data, derive actionable insights, modern society, Data science
备注:
点击查看摘要
Abstract:Data science aims to derive actionable insights from heterogeneous raw data, unlocking the value of the massive amounts of data generated in modern society. Automating this process is essential to reducing labor-intensive efforts for data scientists and enabling scalable data-driven applications. Recently, large language model (LLM)-based data agents have emerged as a promising solution to automate data science workflows. However, the field lacks comprehensive benchmarks to rigorously evaluate these agents across diverse scenarios with fine-grained granularity. To address this gap, we propose AgenticDataBench, a comprehensive benchmark featuring realistic tasks spanning diverse domains with fine-grained ground-truth labels. This enables evaluations to capture the diversity and complexity of data science workflows and the detailed performance of agents. First, to cover diverse domains, we collect real datasets and tasks from 15 vertical domains, including 5 real-world B2B use cases from a leading fintech company. Second, to remove redundancy in real-world tasks and generate high-quality tasks for domains lacking real data, we introduce data science skills, recurring data-centric operational patterns, and quantify benchmark coverage by the number of skills included. Representative skills are extracted from large-scale task solutions on Stack Overflow using skill-aligned hierarchical clustering. Third, for real-world business tasks, we select task-solution pairs that maximize diversity in skill composition, ensuring broad coverage of practical scenarios. Fourth, to generate realistic tasks for devise domains without real tasks, we propose a systematic LLM-based task generation approach to create workflows and tasks based on these skills. Finally, we evaluate state-of-the-art data agents using our annotated benchmark and open-sourced testbed, providing detailed skill-level insights.
62. 【2607.01602】ProWAFT: A ROMA-LPD Instance for Workload-Aware and Dynamic Fault Tolerance in FPGA-Based CNN Accelerators
链接:https://arxiv.org/abs/2607.01602
作者:Xinxin Chen,Haoran Qiao,Yiming Guo,Kecheng Luo,Siyuan Feng,Jingwen Ma
类目:Computation and Language (cs.CL)
关键词:SRAM-based FPGAs provide, latency-constrained CNN inference, SRAM-based FPGAs, network edge, FPGAs provide
备注: 13 pages
点击查看摘要
Abstract:SRAM-based FPGAs provide an attractive platform for energy- and latency-constrained CNN inference at the network edge, yet transient faults can lead to silent errors that compromise reliability. Always-on redundancy (e.g., full TMR) improves correctness but incurs substantial performance and energy overhead, while reactive recovery may introduce unacceptable latency on the critical path. We propose \textbf{ProWAFT}, a proactive workload-aware fault-tolerance framework for FPGA-based CNN accelerators that uses partial reconfiguration to selectively apply TMR across reconfigurable partitions. ProWAFT quantifies workload criticality, models fault propagation and reconfiguration overhead, and selects configurations that minimize a composite objective over latency, energy, and reliability risk. Implemented on a Xilinx Zynq UltraScale+ ZCU104 platform with six reconfigurable regions and evaluated on a 500-task trace derived from ResNet-18, MobileNetV2, and EfficientNet-Lite under time-varying SEU injection, ProWAFT achieves lower composite cost than static TMR and reactive reconfiguration while maintaining high task success rate and near-baseline throughput with low online decision overhead.
63. 【2607.01600】BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
链接:https://arxiv.org/abs/2607.01600
作者:Zewen Liu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, Coupling Amplification Factor, CAF, language models, outputs to converge
备注: 18 pages, 3 figures, 2 tables
点击查看摘要
Abstract:As large language models (LLMs) are deployed as communicating agents, does inter-agent communication cause outputs to converge? We introduce BOUNDARY_SYNC, a protocol measuring representational coupling via the Coupling Amplification Factor (CAF = JSD_cond / JSD_baseline), where CAF 1 indicates homogenization and CAF 1 indicates diversification. In controlled GPT-4o experiments (N=30, ~9,900 API calls), we measure coupling in text and image communication. Key findings: (1) text communication causes significant homogenization (CAF=0.803 [0.740, 0.873], d=1.30, p0.001), confirmed by no-communication ablation and prompt-perturbation controls; (2) image communication also homogenizes under within-modality baselines (CAF=0.834 [0.811, 0.858]), with comparable proportional effect; (3) group size moderates coupling direction -- K=5 produces homogenization while K=3 yields CAF 1.0 (point estimates 1.14 and 1.06, CI pending), suggesting a directional shift toward diversification; (4) cross-model replication shows extreme variation (CAF 0.034-0.803), with DeepSeek dominated by format artifacts; (5) coupling is stateless -- driven by prompt context rather than cumulative updating, with continuous consensus producing monotonic convergence. These results establish LLM agent coupling as real, measurable, and controllable at the prompt level, with direct implications for multi-agent system design.
64. 【2607.01595】Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model
链接:https://arxiv.org/abs/2607.01595
作者:Junyan Tan,Haoran Lin,Siyuan Guo,Yichen Fang,Xinyue Luo,Tianyu Shen,Zeyu Qiao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:ensuring service reliability, Deep Reinforcement Learning, Large Language Models, integrate Large Language, continue to escalate
备注: 13 pages
点击查看摘要
Abstract:As the scale and complexity of cloud-based AI systems continue to escalate, ensuring service reliability through rapid fault detection and adaptive recovery has become a critical challenge. While existing approaches integrate Large Language Models (LLMs) for semantic understanding and Deep Reinforcement Learning (DRL) for policy optimization, they often rely on sequential, loosely coupled architectures that underutilize the generative and reasoning capabilities of LLMs. In this paper, we propose a paradigm shift with PASE, a Planning-Aware Semantic self-healing engine, a novel fault self-healing framework that reconceptualizes recovery as a neuro-symbolic program synthesis task. PASE employs an LLM as a core Plan Synthesis Engine to generate structured recovery plans from a library of semantic primitives. A Neural-Symbolic World Model verifies plan feasibility through simulation, while a Meta-Prompt Optimizer, trained via DRL, learns to generate optimal prompts that guide the LLM's planning process. This tight reason-plan-verify-adapt loop enables dynamic, context-aware recovery strategy generation beyond predefined action spaces. Experiments on a real-world cloud fault injection dataset demonstrate that PASE significantly outperforms state-of-the-art methods, reducing average system recovery time by over 40% and improving fault detection accuracy in unknown fault scenarios. Our framework advances autonomous system management by unifying LLM-based reasoning with model-assisted verification and meta-learned guidance.
65. 【2607.01585】ADVENT: LLM-Driven Automatic Predicate Invention for ILP
链接:https://arxiv.org/abs/2607.01585
作者:Tingting Yu,Pei-Cing Huang,Chan Hsu,Chan-Tung Ku,Yihuang Kang
类目:Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Inductive Logic Programming, Logic Programming, Inductive Logic, bottleneck in Inductive, hypothesis space
备注:
点击查看摘要
Abstract:Predicate invention (PI), the creation of new predicates to extend the hypothesis space, remains a critical bottleneck in Inductive Logic Programming (ILP). Existing methods rely on domain expertise and produce semantically opaque predicates, hindering adaptation to unfamiliar domains and cross-task reuse. We present ADVENT, an LLM-driven PI mechanism for ILP. ADVENT pairs LLM abductive generation with Prolog deductive verification, forming an iterative loop in which concrete execution results guide the LLM to refine candidate predicates. The mechanism leverages Large Language Models to identify implicit patterns in structured relational data and invent auxiliary predicates with meaningful names and definitions. Invented predicates and learned rules accumulate in a knowledge pool for cross-task reuse. Experiments on nine poker-hand concepts across seven LLMs show that LLM-driven PI achieves 58% success rate where ILP alone fails entirely, formal verification raises this to 80%, and the knowledge pool yields gains up to +31 percentage points, while producing human-interpretable rules. These results suggest that ADVENT offers a promising direction for automating predicate invention and enabling cross-task knowledge reuse in ILP.
66. 【2607.01581】Beyond Skepticism: Evaluating LLMs Pedagogical Intent Reasoning with the Adaptive Pedagogical Vigilance Framework
链接:https://arxiv.org/abs/2607.01581
作者:Minghao Chen,Ruihan Zhou,Jiayi Tang,Zihan Xu,Bowen Huang,Yuxin Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, communication remains underexplored, capacity of Large, Intent Inference Engine
备注: 22 pages
点击查看摘要
Abstract:The capacity of Large Language Models (LLMs) to reason about pedagogical intent within instructional communication remains underexplored, particularly in educational domains such as translation pedagogy. To address this, we propose the \textbf{Adaptive Pedagogical Vigilance (APV)} framework, a novel computational formalism that reframes communicative vigilance as an adaptive mechanism for optimizing learning through intent inference. APV formalizes the problem via a Bayesian Pedagogical Intent Inference Engine (PIIE), which models how instructors select content to maximize pedagogical utility and how vigilant learners should inversely reason about latent instructional configurations -- encompassing genre, stance, and incentives. We evaluate APV through a three-tier hierarchy: distinguishing instructional genre, reasoning about structured pedagogical setups, and generalizing to authentic educational discourse. Experiments on leading LLMs (e.g., GPT-4o, Claude 3.5) show that APV substantially improves model vigilance. It achieves the strongest discrimination between pedagogical and exposure-based content, correlates highly with human judgments ($r=0.958$), and maintains robust performance on naturalistic data where baseline methods degrade. This work establishes a unified framework for assessing and enhancing LLMs' understanding of pedagogical motives, advancing the development of more reliable AI-assisted learning systems.
67. 【2607.01557】DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents
链接:https://arxiv.org/abs/2607.01557
作者:Tianyi Zhang,Mousumi Das,Abrar Anwar,Jesse Thomason,David Traum
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, Dialogue Policy Selection, Large
备注: Proceedings of the 27th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2026)
点击查看摘要
Abstract:Large Language Models (LLMs) often struggle with persuasion in high-stakes scenarios. People's individual personalities and concerns require tailored strategies rather than a one-size-fits-all approach. To address this challenge, we focus on a fire-rescue scenario in which an operator must persuade a resident to evacuate as a high-stakes persuasion domain and propose Dialogue Policy Selection (DiPS), a Q-learning framework to dynamically select persuasion strategies adapted to the evolving conversational context. Specifically, we train a critic, trained to maximize the chance of evacuation success, to select a persuasion policy at each turn based on the resident's recent this http URL then evaluate DiPS against multiple baselines in both simulated and real human interactions. We find that DiPS achieves higher evacuation success than a zero-shot LLM and generic RAG-augmented approach.
68. 【2607.01538】Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
链接:https://arxiv.org/abs/2607.01538
作者:Siddharth Gollapudi,Nilesh Gupta,Prasann Singhal,Sewon Min
类目:Computation and Language (cs.CL)
关键词:Language models, raise an intriguing, relevant answer, directly generating, generating a relevant
备注:
点击查看摘要
Abstract:Language models (LMs) raise an intriguing alternative to vector-based retrieval: conditioning on an in-context corpus and directly generating a relevant answer. However, prior work has largely focused on proprietary systems or the smaller-scale reranking task, leaving corpus-scale in-context retrieval largely unexplored. In this work, we present the first systematic study of in-context retrieval on two scales practical retrievers demand: million-token corpora and length-generalization far beyond training-time sizes. We first introduce BlockSearch, a 0.6B LM retriever whose architectural and training modifications improve over prior LM baselines and length-generalize up to 10 times beyond its training regime. Nevertheless, retrieval still collapses under more extreme extrapolation. We trace this failure to an attention dilution effect: as the corpus grows, irrelevant documents dominate the softmax denominator, reducing the normalized mass on the gold document even when its pre-softmax score stays high. Motivated by this analysis, we introduce length-aware adjustments to the attention softmax and document-level sparse attention. With these modifications, at the million-token scale, our model matches dense retrieval on widely studied benchmarks (e.g, MS MARCO and NQ), while outperforming the concurrent model MSA despite being 7 times smaller. Furthermore, it significantly outperforms dense retrieval on tasks requiring entirely different notions of similarity, such as LIMIT, achieving a 3 times higher score. Together, our results position in-context retrieval a promising alternative to classical retrieval while emphasizing attention control under extreme context growth as a new challenge.
69. 【2607.01523】Multi-Head Recurrent Memory Agents
链接:https://arxiv.org/abs/2607.01523
作者:Jiatong Li,Samuel Yeh,Sharon Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:iteratively consolidating input, agents extend LLMs, arbitrarily long contexts, fixed-size memory window, memory agents extend
备注: 19 pages, 11 figures, 5 tables
点击查看摘要
Abstract:Recurrent memory agents extend LLMs to arbitrarily long contexts by iteratively consolidating input into a fixed-size memory window. Despite their scalability, these agents exhibit a well-documented reliability problem: end-to-end performance degrades systematically as context length grows. We diagnose this failure by decomposing performance into two factors--memory capture and memory retention--and quantitatively confirm that retention is the dominant bottleneck. Retention collapses because existing designs maintain memory as a monolithic text block, forcing every update to risk overwriting previously retained content. Motivated by this diagnosis, we propose Multi-Head Recurrent Memory (MHM), a general, training-free framework that partitions memory into independent heads governed by a stage-wise select-then-update strategy. At each step, exactly one head is selected for update while the remaining heads are structurally shielded from overwriting, shifting the burden of retention from model behavior to architectural design. As a lightweight instantiation, we introduce Least-Recently-Updated MHM (MHM-LRU), which guarantees uniform head utilization with zero additional token overhead. Extensive experiments on long-context benchmarks show that MHM-LRU substantially improves both retention and end-to-end accuracy across the 100K--1M token range, where baselines degrade sharply. On RULER-HQA at 896K tokens, MHM-LRU improves the memory retention rate from less than 30% to 73.96%. These gains generalize across model families, scales, and task types, positioning architectural optimization as a practical and cost-efficient path toward reliable long-context recurrent memory.
70. 【2607.01517】Parameter Golf: What Really Works?
链接:https://arxiv.org/abs/2607.01517
作者:Prashanna Mani Paudel,Shivanand Venkanna Sheshappanavar
类目:Computation and Language (cs.CL)
关键词:strict artifact budget, language model, language model improve, artifact budget, Parameter Golf posed
备注:
点击查看摘要
Abstract:How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights) required to fit within 16 MB and be trained in under ten minutes on 8xH100 SXM GPUs. Quality was measured in bits-per-byte (BPB), the average number of bits required to encode each byte of unseen text. We analyze 2,037 pull requests and 1,430 clean scored submissions from the contest, build a taxonomy of 84 optimization techniques, and measure each technique's contribution to BPB. The verified leaderboard score dropped from 1.2244 to 1.058 BPB across three phases -- a 13.6% reduction, despite individual techniques rarely improving BPB by more than 1%. We show that most gains in techniques shrink across competitive submissions, isolating the few methods that improve performance across stacks.
71. 【2607.01502】From Monolingual to Multilingual: Evaluating Mamba for ASR in South African Languages
链接:https://arxiv.org/abs/2607.01502
作者:Jesujoba O. Alabi,Julian Herreilers,Badr M. Abdullah,Dietrich Klakow
类目:Computation and Language (cs.CL)
关键词:including Conformer-based models, newer state space, including Conformer-based, Recent advances, state space models
备注: under review
点击查看摘要
Abstract:Recent advances in automatic speech recognition (ASR) have explored different sequence models, including Conformer-based models and newer state space models such as Mamba. Although prior work has evaluated these architectures in multiple languages, their effectiveness in African languages remains underexplored. In this work, we evaluate Mamba for ASR on seven South African languages. In monolingual experiments, each model is trained on 50 hours of speech per language, and we compare Mamba to a Conformer baseline of similar parameter scale. Mamba achieves similar recognition accuracy to Conformer while using fewer computational resources and training faster. We further evaluate generalization in this setting and find that both models struggle to generalize to speech that is much longer than what they were trained on. We then study multilingual ASR using Mamba models, where the baseline is pooling all languages together. On top of this, we tested three extensions: training with language-family information by adding both language and language-family embeddings as biases to the downsampled acoustic representations, and multitask learning with a CTC ASR objective and a language identification (LID) head. We find that multilingual training consistently improves performance over monolingual training. However, adding explicit language information does not improve in-domain performance but does improve cross-corpus robustness. We conducted ablation studies in low-resource multilingual settings using 5-hour and 10-hour per-language training data, where we observed gains from using language embeddings and further demonstrated that removing or altering them hurt model performance. Lastly, we analysed these embeddings and find that they do not capture linguistic similarity in a typological sense, but instead act as task-specific control vectors.
72. 【2607.01464】Comparing Architectures for Supervised Political Scaling
链接:https://arxiv.org/abs/2607.01464
作者:Anna Golub,Sebastian Padó
类目:Computation and Language (cs.CL)
关键词:positioning political actors, Text scaling, positioning political, political actors, political analysis
备注:
点击查看摘要
Abstract:Text scaling, the task of positioning political actors on an ideological scale, is a fundamental task in political analysis. To ease the need for manual analysis, various NLP methods have been proposed for this task, including classification- and regression-based approaches, showing successes as well as limitations. The goal of our paper is to consolidate the state of the art in this area. We ask two questions: (a) Can the performance of scaling methods be improved by predicting scales not individually but jointly? (b) Is there a middle ground between classification and regression?
73. 【2607.01457】Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting
链接:https://arxiv.org/abs/2607.01457
作者:Shashank Indukuri,Adarsh Agrawal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:anachronistic technology injection, applicant tracking systems, general text generation, cross-domain terminology contamination, introducing hallucination failures
备注: 13 pages, 1 figure. Equal contribution by both authors. Code and data: [this https URL](https://github.com/shashank-indukuri/grounded-optimization)
点击查看摘要
Abstract:Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data.
Comments:
13 pages, 1 figure. Equal contribution by both authors. Code and data: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2607.01457 [cs.CL]
(or
arXiv:2607.01457v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2607.01457
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
74. 【2607.01444】On the Utility and Factual Reliability of Pruned Mixture-of-Experts Models in the Biomedical Domain
链接:https://arxiv.org/abs/2607.01444
作者:Atsuki Yamaguchi,Szymon Palucha,Léo Bijar,Aline Villavicencio,Nikolaos Aletras
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:offer inference speedups, impose substantial memory, substantial memory requirements, models offer inference, remain loaded
备注: Under review
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models offer inference speedups via selective activation but impose substantial memory requirements because the whole network must remain loaded. Structured expert pruning is a practical approach for reducing deployment costs in resource-constrained settings. However, prior studies primarily evaluate benchmark utility, leaving the effect of pruning on factual reliability underexplored, particularly in high-stakes domains such as biomedicine. In this paper, we investigate how domain-specific expert pruning affects both utility and reliability. We assess four MoE models, six pruning methods, and multiple pruning ratios across generation and classification tasks under in-domain (biomedical) and cross-domain settings. Results reveal that moderate pruning preserves in-domain utility without immediate reliability decline, although hallucination risks increase at extreme pruning ratios. When shifting to the general domain, both utility and reliability degrade rapidly. These findings indicate that safe compression depends heavily on the task and domain. Evaluating pruned MoE models solely on utility is inadequate for high-stakes deployment without reliability assessment.
75. 【2607.01440】FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning
链接:https://arxiv.org/abs/2607.01440
作者:Zhiyun Zhang,Liwen Sun,Xiang Qian,Chenyan Xiong
类目:Computation and Language (cs.CL)
关键词:clinical decisions require, decisions require transparent, require transparent justification, transparent justification grounded, Faithful reasoning
备注: 15 pages, 5 figures
点击查看摘要
Abstract:Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning. To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%). This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at this https URL.
76. 【2607.01431】IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs
链接:https://arxiv.org/abs/2607.01431
作者:Samir Abdaljalil,Erchin Serpedin,Hasan Kurban
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:isomorphic cross-domain science, cross-domain science problem, LLM evaluation, retrieval in LLM, science problem pairs
备注:
点击查看摘要
Abstract:We introduce ISOSCI, a benchmark of isomorphic cross-domain science problem pairs that separates reasoning ability from domain knowledge retrieval in LLM evaluation. Each pair shares identical logical structure but requires different domain-specific knowledge, enabling controlled attribution of reasoning-mode gains. Across five model pairs spanning four model families, we find that 91.3% of reasoning-mode gains are knowledge-dependent rather than structure-invariant (63/69 gains; Wilson 95% CI [82.3%, 96.0%]), directly challenging the assumption that chain-of-thought reasoning improves short-horizon procedural scientific problem-solving. Reasoning toggles on highly capable models provide less than 5 percentage points accuracy gain across all domains, and a reasoning-specialized model (o3-mini) that outperforms its standard counterpart on GPQA Diamond (+19.2 percentage points) underperforms on ISOSCI (-24.7 percentage points), showing that benchmark choice determines conclusions about reasoning utility. We release ISOSCI at this https URL
77. 【2607.01420】MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering
链接:https://arxiv.org/abs/2607.01420
作者:Dang Quang Thien Tran,Quang V. Dang,Vinamra Tyagi,Sai Soorya Rao Veeravalli,Trang Nguyen,Ryan A. Rossi,Franck Dernoncourt,Nedim Lipka,Koustava Goswami,Samyadeep Basu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:accurately attributing generated, attributing generated answers, accurately attributing, systems are increasingly, increasingly deployed
备注: 25 pages (8 main, 17 references + appendix), 15 figures, Submitted to EMNLP 2026 Conference (Long Paper)
点击查看摘要
Abstract:As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under-researched. As a result, we introduce MultAttnAttrib, a training-free attribution-generation method that leverages a model's prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document. To establish baseline results for the method, we introduce MultAttrEval, a complementary benchmark dataset annotated with fine-grained, ground-truth attributions for answer components grounded in multimodal source documents. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long-form documents. Experimental results show that MultAttnAttrib consistently outperforms a variety of attribution-generation methods, including several strong prompting-based approaches and matches the latest frontier models such as GPT 5.4. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one-seventh of the direct inference latency compared to prompting on the same base model.
78. 【2607.01392】Multi-Objective Exploration and Preference Optimization via Mutual Information
链接:https://arxiv.org/abs/2607.01392
作者:Hongyan Xie,Yikun Ban,Ruiyu Fang,Zixuang Huang,Deqing Wang,Jianxin Li,Shuangyong Song
类目:Computation and Language (cs.CL)
关键词:Aligning large language, Aligning large, conflicting preference dimensions, preference vectors, large language models
备注: Accepted at ECML/PKDD 2026
点击查看摘要
Abstract:Aligning large language models with diverse and heterogeneous human values requires multi-objective alignment methods to effectively trade off conflicting preference dimensions. Current methods achieve this trade-off by training policies conditioned on preference vectors and leveraging online direct preference optimization. However, exploration uncertainty can cause the reward distributions of responses generated under different preference vectors to overlap, and the generated responses may fail to effectively align with the corresponding preference vectors. In this paper, we propose Multi-Objective Exploration and Preference Optimization via Mutual Information (MI-EPO), an information-theoretic framework. It unifies multi-objective exploration and alignment by maximizing the joint conditional mutual information among generated responses, preference feedback, and preference vectors. By incorporating a probabilistic routing mechanism, MI-EPO naturally decomposes objective alignment and preference-aware exploration, encouraging the model to generate responses that are distinguishable and aligned with different preference conditions. Experiments on safe alignment and helpful assistant tasks show that MI-EPO significantly improves the alignment between generated responses and preference vectors, makes the outputs more controllable, and achieves stable trade-offs across multiple objectives.
79. 【2607.01388】RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation
链接:https://arxiv.org/abs/2607.01388
作者:M. K. Arabov
类目:Computation and Language (cs.CL)
关键词:robust financial analysis, Multi-step symbolic, Multi-step symbolic reasoning, essential for robust, Multi-step
备注: Preprint
点击查看摘要
Abstract:Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning steps. FINCHAIN introduced verifiable Chain-of-Thought (CoT) evaluation but is limited to English. FINESSE-Bench includes a Russian block but relies on multiple-choice questions without step-level supervision. We present RusFinChain, the first Russian-language symbolic benchmark for verifiable CoT reasoning in finance. It spans 17 domains, 172 topics, and comprises 5,280 parameterized examples from executable Python templates, ensuring contamination-free evaluation. Each example includes a gold-standard reasoning chain with intermediate numeric values for automatic verification. We also introduce enhanced metrics: Fuzzy Numeric Alignment and Soft-Attention Alignment. We evaluate 8 open-weight LLMs on a stratified sample, generating 8,100 responses. Results reveal a substantial reasoning gap: models achieve Hard F1 of ~0.65 for step alignment, but only ~29% of final answers are correct. Our fuzzy and soft metrics show stronger correlation with final-answer correctness (Spearman rho approx 0.48) than the original ChainEval (rho approx 0.38-0.46), demonstrating superior diagnostic power. We release dataset, code, and evaluation framework to foster verifiable financial AI for the Russian-speaking community.
80. 【2607.01345】urnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue
链接:https://arxiv.org/abs/2607.01345
作者:Hao Zhang,Thomas Thebaud,Georgi Tinchev,Venkatesh Ravichandran,Laureano Moro-Velazquez
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:evaluation remains limited, remains limited, spoken dialogue systems, central to full-duplex, full-duplex spoken dialogue
备注:
点击查看摘要
Abstract:Turn-taking naturalness is central to full-duplex spoken dialogue systems, yet its automatic evaluation remains limited. Existing evaluations often rely on human judgments or behavior-specific timing metrics, making it difficult to compare heterogeneous timing failures within a unified framework. We propose TurnNat, a likelihood-based framework for automatic turn-taking naturalness evaluation in two-channel spoken dialogue. A causal turn-taking prediction model trained on natural conversations estimates future two-speaker voice-activity states, and the negative log-likelihood (NLL) of the observed future activity measures timing atypicality. TurnNat pools frame-level NLLs over turn-taking boundary units (TBUs) extracted from utterance onsets and offsets, and aggregates mean and tail TBU scores into a dialogue-level naturalness score. We further construct a controlled perturbation benchmark of paired natural and perturbed dialogue clips, validated by human naturalness judgments. Experiments on this benchmark show that TurnNat successfully identifies unnatural turn-taking perturbations across heterogeneous timing failures.
81. 【2607.01313】Black-Box Inference of LLM Architectural Properties with Restrictive API Access
链接:https://arxiv.org/abs/2607.01313
作者:Christopher Ellis,Shreyas Chaudhari,Mei-Yu Wang,Leighton Barnes,Giulia Fanti,José M. F. Moura
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:publicly release details, commercial LLM providers, LLM, underlying LLM architectures, hidden dimension
备注:
点击查看摘要
Abstract:In practice, most commercial LLM providers do not publicly release details of underlying LLM architectures. However, prior work has shown that given limited API access to an LLM (namely, top-$k$ logits and/or a logit bias function), one can recover certain architectural details of an LLM, such as the hidden dimension of the feed-forward network. Perhaps in response to these results, most commercial LLM providers have restricted their APIs to expose only the single logit for each decoded token, and they no longer give users the ability to bias logits. We show that even under current restrictive APIs, several architectural parameters are still recoverable. We present NightVision, an attack that uses restrictive black-box API access to estimate the hidden dimension, depth, and parameter count of an LLM. Algorithmically, NightVision relies on a novel common set prompting technique in which multiple prompts expose log probabilities for the same set of output tokens; a spectral analysis of these results is used to infer hidden dimension. NightVision additionally uses end-to-end time to first token (TTFT) measurements and the estimated hidden dimension to estimate depth and parameter count. We empirically evaluate NightVision on 32 open-source LLMs, recovering hidden dimension to within 23% average relative error across all models (9% on MoE models), and depth and parameter count to within 53% for models exceeding three billion parameters. We run extensive ablations to demonstrate how these accuracies scale with token budget and model properties. Overall, our results suggest that current LLM APIs are not sufficiently restricted to fully obfuscate the architectural details of their underlying models.
82. 【2607.01293】RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules
链接:https://arxiv.org/abs/2607.01293
作者:Ádám Kovács,Nadia Verdha,Gábor Recski
类目:Computation and Language (cs.CL)
关键词:Named Entity Recognition, Named Entity, Entity Recognition, large language models, generate executable rules
备注: 8 pages
点击查看摘要
Abstract:We present RuleChef, a framework that uses large language models (LLMs) to generate executable rules for NLP tasks such as text classification, Named Entity Recognition (NER), or relation extraction. Rules are generated based on a task description and a set of labeled examples, then they are iteratively improved based both on additional examples and on human feedback overexisting rules. RuleChef can also be used to bootstrap rules using the observed input-output pairs from any existing model for a given task. LLMs are used only at learning time, synthesizing rules and iteratively patching them based on failures measured on a held-out split. The result of this process is a fast, deterministic, and inspectable rule system. Preliminary evaluation is performed on both classification and NER tasks. We release RuleChef as open-source software under an Apache 2.0
83. 【2607.01250】Structuring the Space of Sociotechnical Alignment
链接:https://arxiv.org/abs/2607.01250
作者:Esra Dönmez,Agnieszka Falenska
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Sociotechnical alignment concerns, Sociotechnical alignment, social desirability, alignment, NLP research increasingly
备注: Preprint
点击查看摘要
Abstract:Sociotechnical alignment concerns the social desirability of AI behavior and is thus inherently normative, not merely technical. While NLP research increasingly addresses its technical aspects, it often leaves underspecified what such "social desirability" entails. We argue that this reflects a fundamental gap: the absence of a systematic way to specify how sociotechnical alignment defines, justifies, and evaluates socially desirable AI behavior. To address this gap, we introduce a human-centered framework for specifying sociotechnical alignment. We draw on social-scientific accounts of sociobehavioral desirability to ground the basis for behavioral desirability judgments and use this framework to analyze how alignment is specified in practice. Our systematic literature review identifies recurring patterns: normative concepts grounding desirability judgments are often unspecified or conflated with alignment targets for (desired) system behavior, target populations are underdefined, and design choices are rarely theoretically justified. These findings point to a lack of conceptual specificity that limits cumulative progress. We therefore offer recommendations that link social-scientific frameworks to alignment design choices, supporting more conceptually precise approaches to sociotechnical alignment.
Comments:
Preprint
Subjects:
Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2607.01250 [cs.CY]
(or
arXiv:2607.01250v1 [cs.CY] for this version)
https://doi.org/10.48550/arXiv.2607.01250
Focus to learn more
arXiv-issued DOI via DataCite</p>
84. 【2607.01245】Office Comprehension Benchmark
链接:https://arxiv.org/abs/2607.01245
作者:Firoz Shaik,Mateus Picanço Lima Gomes,Tanvir Aumi,Jingci Wang,Milos Milunovic,Filip Basara,Ivana Jovanovic,Vishwas Suryanarayanan,Neha Nandan Kenkare,Weiyao Xie,Zhipeng Han,Zheng Zhang,Waleed Shahid,Jay Rathi,Russell Scherer,Thong Q. Nguyen,Michael Bentley,Tamara Stankovic,Rasika Chakravarthy,Vishal Chowdhary
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Office Comprehension Bench, introduce Office Comprehension, Comprehension Bench, jointly evaluate LLM, native file formats
备注:
点击查看摘要
Abstract:We introduce Office Comprehension Bench (OCB), the first public benchmark to jointly evaluate LLM systems on Word, Excel, and PowerPoint comprehension over native file formats (.docx, .xlsx, .pptx) and their variants. OCB consists of two tracks. File Fidelity QA tests structural and visual perception of office artifacts - tables, charts, embedded images, formulas, and app-specific elements such as headers, speaker notes, and named ranges. Domain QA tests expert-level reasoning grounded in real-world industry documents across 12 professional domains, with queries requiring multi-step analysis and synthesis across documents. Each reference answer is decomposed into atomic, binary-gradable claims, and an ensemble of LLM judges scores responses against each claim independently. Even the strongest frontier system in its default reasoning mode reaches only about 59.3% on Domain QA; increasing thinking depth within a tier does not move performance materially, while moving to a higher product tier yields modest gains. We release the dataset, evaluation tooling, judge prompt, and a public leaderboard.
85. 【2607.01242】ExPerT: Personalizing LLM Responses to Users' Domain Expertise via Query-Wise Semantic and Keystroke Behavioral Cues
链接:https://arxiv.org/abs/2607.01242
作者:Yeji Park,Jiwon Tark,Taesik Gong
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, text-only signals fail, existing personalization methods, personalization methods relying, query-specific expertise variation
备注: Accepted to ACL 2026 (Main, Long)
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used by end users, yet existing personalization methods relying on static profiles or text-only signals fail to capture query-specific expertise variation. We present ExPerT, a query-wise personalization framework that adapts LLM responses to users' query domain expertise by combining semantic and behavioral cues. ExPerT consists of two key components: (i) a semantic-behavioral expertise inference module that jointly interprets query text and keystroke dynamics via in-context LLM prompting, and (ii) an expertise-conditioned response generation that adapts the level of detail, terminology, and conceptual complexity. Our user study with 40 participants and 1270 queries demonstrated that ExPerT reduced expertise inference error by 65.7% compared to the strongest baseline (MAE = 0.398 vs. 1.162) and improved response satisfaction by 17.52% (from 3.71 to 4.36) on a 5-point Likert scale.
86. 【2607.01241】Mapping Text to Multiplex Graph: Prompt Compression as Lévy Walk-Guided Graph Pruning
链接:https://arxiv.org/abs/2607.01241
作者:Yaxin Gao,Yao Lu,Jinhong Deng,Jiaqi Nie,Zhe Tang,Jian Zhang,Zhaowei Zhu,Shanqing Yu,Qi Xuan,Joey Tianyi Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:flat token sequences, failing to capture, important information, capture the distributed, distributed nature
备注:
点击查看摘要
Abstract:Existing prompt compression methods treat text as flat token sequences, failing to capture the distributed nature of important information, which is often spread across multiple locations and connected through both local syntactic dependencies and global semantic relations. Such relational structure is naturally represented as a graph, where tokens or sentences become nodes and their dependencies become edges. To this end, we propose RAGP, which formulates prompt compression as Redundancy-Aware Graph Pruning on a multiplex graph that jointly models fine-grained attention-based dependencies and coarse-grained semantic relations. To efficiently identify non-redundant nodes in this heterogeneous structure (dense local subgraphs and sparse global connections), we employ Levy walks whose heavy-tailed step distribution naturally balances local exploitation with global exploration. Experiments on LongBench show that RAGP achieves an average score of 49.3 under a 4x compression ratio, outperforming existing LLM-based compression methods, such as LongLLMLingua, which attains 48.8 at a 3x compression ratio. Besides, RAGP also surpasses state-of-the-art vision-based text compression paradigms on multiple tasks. The code is available at this https URL.
87. 【2607.01240】Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring
链接:https://arxiv.org/abs/2607.01240
作者:Dekun Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM error-detection quality, error-detection quality, span localization, gap termed, rise dramatically
备注: 15 pages, 6 figures, 12 tables. Preprint under review
点击查看摘要
Abstract:Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. The paper introduces ErrorBench, a controlled stress-test protocol for prompt-induced count distortion. ErrorBench evaluates six contemporary LLMs under five prompt conditions over 4,290 responses from 143 CoNLL-2014 passages. Under CoNLL-2014 M2-style scoring, anchored prompts produce up to 0.79 points of F1 Inflation, and up to 0.96 under strict matching. A 100-passage replication using the official ERRANT 3.0.0 pipeline and multi-reference scoring reproduces the pattern: averaged over six models, the Blind-to-Anchored prompt shift raises Count-F1 by +0.21 while raising multi-reference ERRANT F0.5 by only +0.04. The study finds larger count responses in highly instruction-compliant GPT/Claude systems and smaller responses in the Gemini family under this stress-test protocol. The findings suggest that LLM proofreading and document-review evaluations should avoid pre-populated error counts and should report span-aware metrics alongside count-based metrics.
88. 【2607.01239】Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
链接:https://arxiv.org/abs/2607.01239
作者:Tung-Ling Li,Hongliang Liu,Yuhao Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Character-level perturbations bypass, Character-level perturbations, leaving prompts human-readable, perturbations bypass safety, bypass safety alignment
备注:
点击查看摘要
Abstract:Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last ${\sim}30\%$ of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall $\geq 99\%$ at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus. On the defense side, a 68-cell grid (55 trained checkpoints) shows that no DPO configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds. SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested. To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels $\pm$18pp; see App.~B.13).
89. 【2607.01238】SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings
链接:https://arxiv.org/abs/2607.01238
作者:Priyam Mazumdar,Yurii Halychanskyi,Steven Guo,Mark Hasegawa-Johnson,Volodymyr Kindratenko
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Recent advances, direct grapheme modeling, synthesis have shifted, Recent, speech synthesis
备注: 5 Pages, 1 Figure, 2 Tables, Interspeech
点击查看摘要
Abstract:Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture speaker-specific acoustic variation. Prior work demonstrates that grapheme-based models outperform phoneme-based systems at scale, but not in low-resource settings. In this paper, we propose SPARCLE, a speaker-aware grapheme representation model that enriches characters with their precise acoustic realizations. SPARCLE is trained with a contrastive objective to align graphemes with corresponding Wav2Vec2 acoustic representations while conditioned on speaker identity. The resulting model serves as a replacement to G2P systems for downstream text-to-speech (TTS) tasks. We demonstrate that SPARCLE improves generation quality, reducing word error rates by half in extreme low-resource settings compared to standard grapheme-based models.
Comments:
5 Pages, 1 Figure, 2 Tables, Interspeech
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:
arXiv:2607.01238 [cs.CL]
(or
arXiv:2607.01238v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2607.01238
Focus to learn more
arXiv-issued DOI via DataCite</p>
90. 【2607.01237】Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression
链接:https://arxiv.org/abs/2607.01237
作者:Shen Han,Yuyang Wu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reasoning language models, high decoding latency, incurs high decoding, Reasoning language, cache compression
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in existing KV cache compression methods: 1) their threshold-triggered compression policy may provide limited throughput improvement or even reduce throughput, and may fully eliminate KV pairs from certain blocks of the sequence, potentially worsening information loss. 2) they typically retain either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to preserve important flexible-sized chunks at arbitrary token positions. To overcome these limitations, we propose Kara, a sliding-window KV cache compression method that performs decoding-time compression by operating only on the recently generated context. Kara leverages bidirectional attention to score and select informative KV pairs in the window. To enable flexible preservation of important semantic information, we design a Token2Chunk module to expand a subset of selected KV pairs into chunks. Furthermore, we adapt Kara to PagedAttention and develop KvLLM, an inference framework built upon vLLM, which reduces KV cache memory usage and effectively improves output throughput. Extensive experiments demonstrate consistent performance improvements of proposed Kara and KvLLM.
91. 【2607.01236】Safeguarding LLM Agents from Misalignment through Provenance Analysis
链接:https://arxiv.org/abs/2607.01236
作者:Yining She,Yiliang Liang,Eunsuk Kang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:gain increasing access, agents gain increasing, user intent, LLM agents gain, gain increasing
备注:
点击查看摘要
Abstract:As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user's intent becomes critical. When an agent's proposed tool invocation deviates from the user's intent -- a phenomenon called misalignment -- it may lead to harmful consequences that are difficult to undo. Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent's context. Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent's action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user's input query. We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, while reducing intervention burden on task-successful traces from 30.5% to 12.8% and introducing no statistically significant increase in unnecessary interventions on aligned traces. These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.
92. 【2607.01235】okenScope: Token-Level Explainability and Interpretability for Code-Oriented Tasks in Large Language Models
链接:https://arxiv.org/abs/2607.01235
作者:Amirreza Esmaeili,Fatemeh Fard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:Large Language Models, Understanding how Large, Large Language, make token-level decisions, Language Models
备注:
点击查看摘要
Abstract:Understanding how Large Language Models (LLMs) make token-level decisions during code generation remains a major challenge for both researchers and practitioners. While recent tools provide insights into model internals or generation outcomes, they often lack decoding-time signals, fine-grained uncertainty measures, and interactive mechanisms for exploring alternative generation paths. We present TokenScope, an interactive interpretability and analysis tool for decoder-based LLMs that exposes token-level metrics, attention patterns, and structural information during generation. TokenScope supports interactive token replacement, counterfactual branching, and code-aware aggregation via abstract syntax trees. By unifying decoding-time signals with structural program analysis, TokenScope enables systematic investigation of LLM behaviour during code generation.
93. 【2607.01951】Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
链接:https://arxiv.org/abs/2607.01951
作者:Minjong Cheon
类目:Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:contested scientific questions, Large language models, treats settled science, Large language, user signals doubt
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly consulted on contested scientific questions, raising the concern that they will sycophantically retreat from established consensus when a user signals doubt -- drifting toward a false balance that treats settled science as one view among several. We test this across three open instruction-tuned models (Llama-3.1-8B, Qwen2.5-7B, Mistral-7B), three consensus-science domains (climate, vaccines, evolution), and single- and multi-turn settings, combining behavioral measurement with linear probing and activation patching. We do not observe sycophantic retreat. Instead, models show three distinct policies under the same skeptical pressure: reactive assertion, where consensus assertion increases rather than decreases (Llama); surface hedging, where tone softens while the position holds (Qwen); and non-response (Mistral). Pairwise judgments confirm the reactive shift is stance, not style (63.6%, p=.007), and a decomposition identifies increased consensus assertion, not false balance, as its driver (beta=+0.042 per dose, p1e-77). Linear probes localize the divergence to middle layers -- perfect separation in Llama and Qwen versus 72% in Mistral, with non-overlapping confidence intervals -- indicating the non-responsive model does not linearly represent the skepticism signal at all. Crucially, this robustness does not transfer: it attenuates across domains and, in the safety-critical vaccine domain, can reverse, with myth-rebuttal weakening under skeptical pressure. We synthesize these into a four-way taxonomy separating active from accidental robustness, and argue that behavioral evaluation alone cannot distinguish a model that resists skepticism because it understands the signal from one that only appears to resist because it fails to perceive it.
94. 【2607.01823】Self-Supervised Test-Time Tuning for Packet Loss Concealment
链接:https://arxiv.org/abs/2607.01823
作者:Yehoshua Dissen,Joseph Keshet
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:PLC, PLC model, speech PLC model, Packet loss concealment, existing PLC models
备注: Under submission to IEEE TASLP
点击查看摘要
Abstract:Packet loss concealment (PLC) reconstructs audio packets that are missing at the receiver, usually with a trained model whose parameters remain fixed at deployment time. This treats the PLC model as static, even though each call or recording exposes signal-specific information through the packets that did arrive. We present TTT-PLC, a self-supervised test-time tuning framework that adapts existing PLC models using only those received packets. The method creates supervision by synthetically masking portions of the available signal, training the model to conceal them with its native PLC objective, and then using the adapted model to reconstruct the true packet losses. No clean reference signal, external adaptation data, or architectural modification is required. We study TTT-PLC in two deployment settings. In the non-causal setting, the received file is available before reconstruction, allowing repeated self-supervised adaptation passes and providing a per-file adaptation ceiling. In the causal setting, audio is streamed without revising emitted samples; adaptation is performed only on completed past blocks, and updated parameters affect only future audio. We instantiate the framework on two public PLC backbones, FRN, a recurrent full-band speech PLC model, and PARCnet, a hybrid autoregressive-neural model for networked music. Across these settings, the results show that pretrained PLC systems do not need to be treated as fixed at inference time, the still-observed portions of a lossy signal can provide an effective training signal for improving concealment on that same signal.
Comments:
Under submission to IEEE TASLP
Subjects:
Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
Cite as:
arXiv:2607.01823 [eess.AS]
(or
arXiv:2607.01823v1 [eess.AS] for this version)
https://doi.org/10.48550/arXiv.2607.01823
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
信息检索
1. 【2607.02387】Bringing Agentic Search to Earth Observation Data Discovery
链接:https://arxiv.org/abs/2607.02387
作者:Minghan Yu,Youran Sun,Chugang Yi,Yixin Wen,Haizhao Yang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Science Discovery Engine, Discovery Engine, Science Discovery, data centers hold, centers hold thousands
备注: 19 pages, 1 figure, 6 tables
点击查看摘要
Abstract:NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.
2. 【2607.02338】HNSW with Accuracy Guarantees Using Graph Spanners -- A Technical Report
链接:https://arxiv.org/abs/2607.02338
作者:Minghao Li,Raghav Mittal,Sanjivni Rana,Suraj Shetiya,Gautam Das,Nick Koudas
类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Navigable Small World, Hierarchical Navigable Small, Hierarchical Navigable, Small World, Navigable Small
备注: 23 pages, 22 figures, Submitted to VLDB2027
点击查看摘要
Abstract:Hierarchical Navigable Small World (HNSW) graphs serve as the industry standard due to their logarithmic complexity and strong empirical performance. However, HNSW relies on greedy graph traversal, a heuristic that provides no theoretical guarantees of correctness. In this paper, we propose a novel "Certify-then-Rectify" framework that bridges the gap between the speed of heuristic search and the rigor of exact retrieval. Rather than discarding HNSW, our approach first employs a distribution-free statistical certifier to dynamically evaluate the quality of a standard HNSW search with minimal overhead. If certification indicates that the retrieved neighbors are of low quality, the framework safely escalates to a rigorous exact recovery algorithm. To make this exact recovery computationally feasible, we reinterpret the HNSW graph as a geometric spanner and utilize Extreme Value Theory to stochastically estimate its maximum empirical stretch factor. This allows us to mathematically bound the maximum distance of true nearest neighbors. Extensive evaluations on benchmark datasets demonstrate that our tiered framework delivers the average-case speed of HNSW while ensuring the worst-case correctness of exact search and outperforming other applicable approaches.
3. 【2607.02115】Planning over Matrix-Factorization MDPs for Candidate Generation
链接:https://arxiv.org/abs/2607.02115
作者:Mikhail Trapeznikov,Maksim Utushkin
类目:Information Retrieval (cs.IR)
关键词:recommender service, view the customer, customer journey, user state, item recommendations
备注: Accepted to the 5th Workshop on End-to-End Customer Journey Optimization at KDD 2026. 6 pages, 3 figures, 2 tables
点击查看摘要
Abstract:For a recommender service, we view the customer journey as a chain of item recommendations: a useful item changes the user's state and therefore what should be retrieved next. Standard matrix-factorization retrieval ignores this -- it builds one user vector and returns the top-$K$ items by a static score, treating them as independent. We ask a narrow question: when is it worth planning over the user-state dynamics that fold-in induces? To answer it we propose casting top-$K$ retrieval as an MDP over the implicit-ALS posterior $(A^{-1},u)$, where an action is an item and the transition is a closed-form rank-one fold-in, and the trajectory reward combines a relevance similarity with a posterior-alignment term. Under the same fixed embeddings we compare static retrieval, one-step planning, and horizon-$K$ MCTS across five datasets and two protocols: a per-user leave-last-$n$ split and a stricter global time split. Dynamics-aware planning tends to overcome static retrieval on all datasets under leave-last-$n$, and the gains hold on MovieLens-1M and the VK-LSVD slices under the global time split. A single step of lookahead already captures most of the gain, so the lightweight planning layer turns static top-$K$ scoring into a short decision and improves retrieval over fixed collaborative-filtering embeddings, with no retraining and no change to the representation. These gains depend on measuring relevance with cosine rather than inner-product similarity, which is otherwise entangled with item popularity.
4. 【2607.01852】Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
链接:https://arxiv.org/abs/2607.01852
作者:Valentin J. J. Kreileder,Johannes Reisinger,Andreas Fischer
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Augmented Generation Assessment, Retrieval Augmented Generation
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems use the question-answering capabilities of Large Language Models (LLMs) to access information outside their parameters. We evaluate if cluster-based semantic chunking improves retrieval and answer quality compared to fixed-size and recursive chunking evaluating on long, structured academic theses using the Retrieval Augmented Generation Assessment (RAGAs) framework. RAGAs based faithfulness shows limited reliability in this setup. Performance on fixed versus document specific questions varied substantially, likely related to the formatting of documents and preprocessing. Under the tested configuration, cluster-based chunking did not outperform simpler strategies.
5. 【2607.01530】IntentTune: Using user demand and personalization to resolve "unknown" query intents for e-commerce search
链接:https://arxiv.org/abs/2607.01530
作者:Rachith Aiyappa,Ishita Khan,Chester Palen-Michel,Jayanth Yetukuri,Samarth Agrawal,Mehran Elyasi,Shuang Zhou
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Understanding user intent, Understanding user, relevant search results, delivering relevant search, fundamental to delivering
备注:
点击查看摘要
Abstract:Understanding user intent is fundamental to delivering relevant search results in e-commerce. However, substantial fraction of real-world queries are under-specified (e.g., "watch" or "shirt"), lacking explicit attributes such as gender or age group. This ambiguity poses a significant challenge for query intent detection models in e-commerce search systems, which must accurately infer latent user intent (e.g., age, gender) to support effective downstream retrieval. We introduce IntentTune, a framework for resolving ambiguous or under-specified query intents by leveraging either (1) user-specific behavioral signals including search history, browsing activity, and profile attributes or (2) population-level demand patterns aggregated across all users. Through experiments on real-world e-commerce data, we first demonstrate that population-level demand patterns alone are insufficient to reliably infer intent in under-specified queries. We then demonstrate that user-specific behavioral signals -- particularly prior search queries -- outperform both population-level statistics and static profile information for inferring gender, age group, product category, and size intent from underspecified queries.
6. 【2607.01485】CoPersona: Collaborative Persona Graphs for Robust LLM Personalization
链接:https://arxiv.org/abs/2607.01485
作者:Yangtian Zhang,Leyao Wang,Hiren Madhu,Ngoc Bui,Walter Roznyatovskiy,Rex Ying
类目:Information Retrieval (cs.IR)
关键词:frequent users' logs, users' logs capture, Real-world LLM personalization, Real-world LLM, frequent users'
备注: Accepted at KDD '26. 12 pages, 5 figures, 8 tables
点击查看摘要
Abstract:Real-world LLM personalization is often constrained by sparse and skewed user histories: most users provide only a handful of interactions, while even frequent users' logs capture an incomplete and biased view of their preferences. As a result, weakly observed user attributes are difficult to infer, leading to brittle personalization when test-time requests shift toward under-supported facets. Motivated by this limitation, we present CoPersona, a graph-based collaborative personalization framework that completes sparse user profiles by borrowing signals from behaviorally similar peers. However, directly transferring signals is difficult because uneven facet coverage introduces bias into interaction histories, obscuring user similarity in the unstructured global space. To address this issue, CoPersona decomposes interaction histories into multiple facet-level representations and explicitly models peer-to-peer, facet-level alignment through a multiplex persona graph. To effectively leverage peer information at inference time, we employ a dual-branch architecture that combines non-parametric peer retrieval with parametric graph reasoning. Experiments across multiple domains and model scales demonstrate consistent improvements over strong baselines, validating CoPersona as an effective approach for robust LLM personalization.
7. 【2607.01387】Bi-NAS: Towards Effective and Personalized Explanation for Recommender Systems via Bi-Level Neural Architecture Search
链接:https://arxiv.org/abs/2607.01387
作者:Longfeng Wu,Yao Zhou,Tong Zeng,Zhimin Peng,Bhanu Pratap Singh Rawat,Lecheng Zheng,Giovanni Seni,Dawei Zhou
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:navigate vast amounts, helping users navigate, users navigate vast, amounts of information, vital in helping
备注:
点击查看摘要
Abstract:Recommender systems are vital in helping users navigate vast amounts of information, offering personalized suggestions and effective explanations for these recommendations. While previous efforts have attempted to provide such explanations, evaluating their effectiveness across various scenarios remains a challenge. Enhancing these explanations is essential for improving user engagement, trust, and decision-making. To facilitate effective explanations within the recommender system, we propose a Bi-level Neural Architecture Search (Bi-NAS) framework to optimize explanations. This approach simultaneously refines cross-attention mechanisms and feature interaction functions by exploring both intra-layer and inter-layer design spaces. Furthermore, we integrate Large Language Models (LLMs) to enhance explanation generation, leveraging zero-shot prompting to produce more effective and personalized justifications. By aligning user feature preferences with item quality scores, our approach ensures that explanations reflect both user intent and item attributes, improving transparency and reasoning depth. Extensive evaluations on four real-world datasets demonstrate that Bi-NAS not only boosts recommendation accuracy but also significantly improves the effectiveness of explanations for recommender systems, providing users with clear and reliable insights into the suggestions they receive.
8. 【2607.01276】Embedding Inference Attack
链接:https://arxiv.org/abs/2607.01276
作者:Cedric Fitiavana Raelijohn,Sébastien Gambs,Jean-Francois Rajotte
类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:modern Information Retrieval, Information Retrieval, modern Information, hidden behind APIs, essential components
备注: 12 pages
点击查看摘要
Abstract:Embedding models are essential components of modern Information Retrieval (IR) systems, yet they are typically hidden behind APIs. Recent works have shown that dense IR system can lead to security vulnerabilities such as embedding inversion attacks. However, such attacks usually require that the attacker knows the embedding model for the attack to be applicable. In this paper, we study IR systems under a black-box setting in which the adversary observes only the unordered set of retrieved documents, without ranking or similarity scores. We demonstrate that in such contexts, tailored queries allow an adversary to identify which embedding model is in use from a set of known model candidate, which we coin as an embedding inference attack (EIA). We also show that certain queries remain discriminative even when the system includes a reranker as a potential defense mechanism. We further validate our method on a real Retrieval-Augmented Generation (RAG) system, in which the tailored queries bypass the LLM's tendency to reject inputs it does not recognize as well-formed questions. Finally, we propose and evaluate other mitigation strategies such as similarity thresholds.
9. 【2607.01245】Office Comprehension Benchmark
链接:https://arxiv.org/abs/2607.01245
作者:Firoz Shaik,Mateus Picanço Lima Gomes,Tanvir Aumi,Jingci Wang,Milos Milunovic,Filip Basara,Ivana Jovanovic,Vishwas Suryanarayanan,Neha Nandan Kenkare,Weiyao Xie,Zhipeng Han,Zheng Zhang,Waleed Shahid,Jay Rathi,Russell Scherer,Thong Q. Nguyen,Michael Bentley,Tamara Stankovic,Rasika Chakravarthy,Vishal Chowdhary
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Office Comprehension Bench, introduce Office Comprehension, Comprehension Bench, jointly evaluate LLM, native file formats
备注:
点击查看摘要
Abstract:We introduce Office Comprehension Bench (OCB), the first public benchmark to jointly evaluate LLM systems on Word, Excel, and PowerPoint comprehension over native file formats (.docx, .xlsx, .pptx) and their variants. OCB consists of two tracks. File Fidelity QA tests structural and visual perception of office artifacts - tables, charts, embedded images, formulas, and app-specific elements such as headers, speaker notes, and named ranges. Domain QA tests expert-level reasoning grounded in real-world industry documents across 12 professional domains, with queries requiring multi-step analysis and synthesis across documents. Each reference answer is decomposed into atomic, binary-gradable claims, and an ensemble of LLM judges scores responses against each claim independently. Even the strongest frontier system in its default reasoning mode reaches only about 59.3% on Domain QA; increasing thinking depth within a tier does not move performance materially, while moving to a higher product tier yields modest gains. We release the dataset, evaluation tooling, judge prompt, and a public leaderboard.
10. 【2607.01244】Retrieval-Augmented Generation to Support Railways Engineering Tasks: A Case Study
链接:https://arxiv.org/abs/2607.01244
作者:Andrea Gerardo Russo,Federico Ruggeri,Ivan Tomarchio,Davide Bombini,Nicolò Donati,Gianmarco Pappacoda,Paolo Torroni,Giuseppe-Emiliano La Cara
类目:Information Retrieval (cs.IR); Computers and Society (cs.CY)
关键词:technical regulations represent, growing number, number and complexity, represent an important, important challenge
备注:
点击查看摘要
Abstract:The growing number and complexity of technical regulations represent an important challenge for all professionals in regulated industries. This paper describes a case study, from design to deployment, of building a Retrieval-Augmented Generation system for the consultation of complex technical regulations in the railway domain. Although developed for the railway sector, this testimony of an industrial experience is of particular value for technical domains where regulatory compliance and accurate information retrieval from complex documentation are essential requirements. It also constitutes a human-centered approach for implementing LLM-powered technical documentation consultation across various regulated industries, balancing technological capabilities with domain expertise.
11. 【2607.01243】STRUCTSURVEY: Structured Agentic Retrieval for Automated Survey Paper Generation
链接:https://arxiv.org/abs/2607.01243
作者:Paolo Pedinotti,Enrico Santus
类目:Information Retrieval (cs.IR)
关键词:synthesize research progress, Large Language Models, scientific publications makes, research progress, rapid growth
备注: 8 pages, 1 figure, appendices, SurgeLLM, RAG4Reports, ACL
点击查看摘要
Abstract:The rapid growth of scientific publications makes it increasingly difficult to track and synthesize research progress. While Large Language Models (LLMs) can support automated survey generation, existing methods retrieve unstructured data and require models to infer conceptual, methodological, and taxonomic relations from raw text at generation time. We introduce STRUCTSURVEY, a hierarchical multi-agent framework that shifts structural reasoning from generation to retrieval by dynamically constructing graph-based representations of entities, relations, and topical taxonomies. We evaluate STRUCTSURVEY on a new reference-grounded benchmark of ACL survey papers for reproducible long-form scientific summarization. Compared with embedding-only retrieval baselines, STRUCTSURVEY improves ROUGE-1 recall by +2.9 and ROUGE-2 recall by +1.0 on average, without reducing precision. It also improves LLM-as-a-Judge ratings for logical structure, depth, and synthesis, showing that explicit structural retrieval yields surveys closer to human-written organization and reasoning.
12. 【2607.01242】ExPerT: Personalizing LLM Responses to Users' Domain Expertise via Query-Wise Semantic and Keystroke Behavioral Cues
链接:https://arxiv.org/abs/2607.01242
作者:Yeji Park,Jiwon Tark,Taesik Gong
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, text-only signals fail, existing personalization methods, personalization methods relying, query-specific expertise variation
备注: Accepted to ACL 2026 (Main, Long)
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used by end users, yet existing personalization methods relying on static profiles or text-only signals fail to capture query-specific expertise variation. We present ExPerT, a query-wise personalization framework that adapts LLM responses to users' query domain expertise by combining semantic and behavioral cues. ExPerT consists of two key components: (i) a semantic-behavioral expertise inference module that jointly interprets query text and keystroke dynamics via in-context LLM prompting, and (ii) an expertise-conditioned response generation that adapts the level of detail, terminology, and conceptual complexity. Our user study with 40 participants and 1270 queries demonstrated that ExPerT reduced expertise inference error by 65.7% compared to the strongest baseline (MAE = 0.398 vs. 1.162) and improved response satisfaction by 17.52% (from 3.71 to 4.36) on a 5-point Likert scale.
计算机视觉
1. 【2607.02517】WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory
链接:https://arxiv.org/abs/2607.02517
作者:Hanlin Wang,Hao Ouyang,Qiuyu Wang,Wen Wang,Qingyan Bai,Ka Leong Cheng,Yue Yu,Yixuan Li,Yihao Meng,Zichen Liu,Yanhong Zeng,Yujun Shen,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unrestricted viewpoint exploration, highly controllable video, model framework designed, controllable video world, world model framework
备注: Project Page: [this https URL](https://worlddirector.github.io/)
点击查看摘要
Abstract:We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: this https URL
2. 【2607.02516】Alignment Is All You Need For X-to-4D Generation
链接:https://arxiv.org/abs/2607.02516
作者:Qiaowei Miao,Kehan Li,Yawei Luo,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative diffusion models, synthesizing high-quality images, Object Distance Alignment, diffusion models excel, Generative diffusion
备注:
点击查看摘要
Abstract:Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: this https URL.
3. 【2607.02515】PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation
链接:https://arxiv.org/abs/2607.02515
作者:Haofei Xu,Rundi Wu,Philipp Henzler,Nikolai Kalischek,Michael Oechsle,Fabian Manhardt,Marc Pollefeys,Andreas Geiger,Federico Tombari,Michael Niemeyer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reconstruction methods, methods often rely, compress geometry, spaces in order, order to leverage
备注: ICML 2026. Project page: [this https URL](https://haofeixu.github.io/pointdit/)
点击查看摘要
Abstract:State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.
4. 【2607.02508】From SRA to Self-Flow: Data Augmentation or Self-Supervision?
链接:https://arxiv.org/abs/2607.02508
作者:Dengyang Jiang,Mengmeng Wang,Harry Yang,Jingdong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accelerate diffusion transformer, improve generation quality, Representation alignment, generation quality, diffusion transformer training
备注:
点击查看摘要
Abstract:Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.
5. 【2607.02504】Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas
链接:https://arxiv.org/abs/2607.02504
作者:Yuxuan Li,Lingxi Xie,Xinyue Huo,Jihao Qiu,Jiacheng Shao,Pengfei Chen,Jiannan Ge,Kaiwen Duan,Qi Tian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:comprehensive video understanding, deciphering complex storyline, Long-form TV dramas, video understanding, dramas present
备注: Accepted to ICML 2026
点击查看摘要
Abstract:Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: this https URL.}
6. 【2607.02501】Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots
链接:https://arxiv.org/abs/2607.02501
作者:Ling Xu,Chuyu Han,Borui Li,Hao Wu,Shiqi Jiang,Ting Cao,Chuanyou Li,Sheng Zhong,Shuai Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Operating Systems (cs.OS)
关键词:model-specific Python stacks, robot-side glue code, Python stacks, model-specific Python, http URL
备注: 12 pages, 2 figures, Project website: [this https URL](https://github.com/SEU-PAISys/Embodied.cpp)
点击查看摘要
Abstract:Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present this http URL, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, this http URL captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate this http URL on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that this http URL improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.
7. 【2607.02497】Seek to Segment: Active Perception for Panoramic Referring Segmentation
链接:https://arxiv.org/abs/2607.02497
作者:Song Tang,Shuming Hu,Xincheng Shuai,Henghui Ding,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing referring segmentation, passively process static, process static images, static images captured, Active Panoramic Referring
备注: ECCV 2026, Project Page: [this https URL](https://henghuiding.com/APRS/)
点击查看摘要
Abstract:Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
8. 【2607.02494】owards Robustness against Typographic Attack with Training-free Concept Localization
链接:https://arxiv.org/abs/2607.02494
作者:Bohan Liu,Wenqian Ye,Guangzhi Xiong,Zhenghao He,Sanchit Sinha,Aidong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Contrastive Language-Image Pretraining, Large Vision Language, Vision Language Models, modern Large Vision, CLIP models exhibit
备注: 15 pages main text, provisionally accepted to ECCV 2026
点击查看摘要
Abstract:Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at this https URL.
9. 【2607.02490】Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning
链接:https://arxiv.org/abs/2607.02490
作者:Liyan Tang,Fangcong Yin,Greg Durrett
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:generating textual chains, Large vision-language models, Large vision-language, chains of thought, reason over multimodal
备注:
点击查看摘要
Abstract:Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.
10. 【2607.02486】GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training
链接:https://arxiv.org/abs/2607.02486
作者:Yejun Zhang,Xinjue Wang,Zihan Wang,Esa Rahtu,Juho Kannala
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserves scene privacy, simplifies map maintenance, localization eliminates high-dimensional, high-dimensional descriptor storage, visual localization eliminates
备注: ECCV 2026
点击查看摘要
Abstract:Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{this https URL}{\text{this links}}$.
11. 【2607.02484】Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning
链接:https://arxiv.org/abs/2607.02484
作者:Xuehui Wang,Xuankun Yang,Wei Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:redundant image patches, compressing redundant image, preserve critical cues, image patches, crucial strategy
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.
12. 【2607.02479】EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\circ$
链接:https://arxiv.org/abs/2607.02479
作者:Jingtao Xu,Zizhuo Lin,Jianwen Sun,Yi Yang,Yawei Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, exposes fundamental limitations
备注: Preprint
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.
13. 【2607.02471】Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment
链接:https://arxiv.org/abs/2607.02471
作者:Ziyao Wang,Maonan Wang,Yucheng He,Xianping Ma,Ziyi Wang,Hongyang Zhang,Yirong Cheng,Man-on Pun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:optical remote sensing, Cloud removal, Geo-Anchored Cloud Removal, remote sensing, change detection
备注: accepted by ECCV 2026
点击查看摘要
Abstract:Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at this https URL.
14. 【2607.02461】OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers
链接:https://arxiv.org/abs/2607.02461
作者:Donghyun Lee,Jitesh Chavan,Duy Nguyen,Sam Huang,Liming Jiang,Priyadarshini Panda,Timo Mertens,Saurabh Shukla
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:make inference expensive, growing parameter count, parameter count make, count make inference, inference expensive
备注:
点击查看摘要
Abstract:Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.
15. 【2607.02435】MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection
链接:https://arxiv.org/abs/2607.02435
作者:A.S. Anudeep,Vaanathi Sundaresan
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:automated diagnostic systems, diagnostic systems remain, models routinely misclassify, deep models routinely, robust OOD detection
备注:
点击查看摘要
Abstract:For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF) classifier capable of learning non-linear decision boundaries, with theoretical proof of its asymptotic connection to cosine classifiers; (2) a multi-expert framework in which margin-aware NvMF classifiers specialise in different regions of label distribution to better handle imbalance; and (3) an outlier expert trained explicitly to distinguish inlier from outlier data, thereby strengthening OOD detection. Evaluation on RFMiD, ISIC2019, and NCTCRC datasets demonstrates consistent improvements over state-of-the-art methods, achieving mean FPR95 reductions of 8.45%, 13.02%, and 36.90% respectively. These gains are further supported by comprehensive ablations that validated the contributions of each component. This enables reliable identification of unfamiliar cases for deferral to clinicians, supporting safer AI-assisted diagnosis in real-world workflows. Our code is available at this https URL.
16. 【2607.02425】Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs
链接:https://arxiv.org/abs/2607.02425
作者:Francesca Pistilli,Simone Alberto Peirone,Giuseppe Averta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:behavior while interacting, surrounding world, applications of embodied, scene, Understanding human behavior
备注: Project page at [this https URL](https://francescapistilli.github.io/GLEN)
点击查看摘要
Abstract:Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene state. To reason over this representation, we propose GLEN, a graph-based model that operates over scene graph sequences to both align them with textual actions and model their temporal evolution. In addition, we formulate the activity-driven graph-edit forecasting (A-GEF) problem, a novel task that casts scene dynamics as a sequence of structured transformations conditioned on ongoing actions, enabling explicit reasoning about how scenes change over time. We validate our approach across multiple downstream tasks, spanning retrieval benchmarks as EgoMCQ and EgoCVR, as well as long-horizon reasoning benchmarks as EXPLORE-Bench and the newly introduced A-GEF. GLEN achieves strong results compared to raw video baselines and it excels in reasoning settings, typically addressed only with MLLMs, while enabling controllable and structured predictions of scene dynamics driven by human activities. We believe our results establish spatio-temporal scene graphs, together with models that reason over them, as strong compositional and interpretable representations for video understanding and potentially beyond.
17. 【2607.02421】Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing
链接:https://arxiv.org/abs/2607.02421
作者:Anqi Tang,Wenhao Sun,Zhaoqiang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Text-guided image editing, modify visual content, image editing aims, Text-guided image, aims to modify
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately preserved. Inspired by this observation, we propose an inversion-free, frequency-aware semantic compensation strategy that strengthens the effective signal in the early stage of generation, while maintaining structural consistency in the background. The proposed method improves global editing capacity without sacrificing background fidelity.
18. 【2607.02417】LIME: Learning Intent-aware Camera Motion from Egocentric Video
链接:https://arxiv.org/abs/2607.02417
作者:Boyang Sun,Jiajie Li,Yung-Hsu Yang,Chenyangguang Zhang,Tim Engelbracht,Sunghwan Hong,Cesar Cadena,Marc Pollefeys,Hermann Blum
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Autonomous robots, language-conditioned camera motion, camera motion, language-conditioned camera, camera motion remains
备注:
点击查看摘要
Abstract:Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.
19. 【2607.02407】xt-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments
链接:https://arxiv.org/abs/2607.02407
作者:Xianhui Meng,Zirui Song,Yuchen Zhang,Li Zhang,Yongxuan Lv,Xiuying Chen,Kun Wang,Yan Luo,Kai Chen,Hangjun Ye,Long Chen,Jun Liu,Xiaoshuai Hao
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, demonstrated remarkable capabilities, Language Models, Large Language, demonstrated remarkable
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.
20. 【2607.02404】Object-centric LeJEPA
链接:https://arxiv.org/abs/2607.02404
作者:Jakob Geusen,Ender Konukoglu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deliver strong features, typically require large, Image encoders trained, large training datasets, image-level self-supervised methods
备注:
点击查看摘要
Abstract:Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).
21. 【2607.02403】ACID: Action Consistency via Inverse Dynamics for Planning with World Models
链接:https://arxiv.org/abs/2607.02403
作者:Gawon Seo,Dongwon Kim,Suha Kwak
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:popular paradigm, paradigm for embodied, action-conditioned world models, Decision-time planning, decision-time planning framework
备注: Project Page: [this https URL]( [this https URL](https://gawon1224.github.io/ACID/) )
点击查看摘要
Abstract:Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline's accuracy with substantially less planning compute.
22. 【2607.02402】Show Me Examples: Inferring Visual Concepts from Image Sets
链接:https://arxiv.org/abs/2607.02402
作者:Nick Stracke,Kolja Bauer,Stefan Andreas Baumann,Miguel Angel Bautista,Josh Susskind,Björn Ommer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex textual instructions, follow complex textual, Vision-language models, textual instructions, follow complex
备注: for code, view [this https URL](https://github.com/CompVis/set-learner)
点击查看摘要
Abstract:Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.
23. 【2607.02386】ransformer Geometry Observatory TGO-II: Representational Similarity Observatory
链接:https://arxiv.org/abs/2607.02386
作者:Kaustubh Kapil,Kishor P. Upla
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:remains insufficiently understood, achieved remarkable success, training remains insufficiently, language applications, insufficiently understood
备注:
点击查看摘要
Abstract:While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.
24. 【2607.02375】Representation Distribution Matching for One-Step Visual Generation
链接:https://arxiv.org/abs/2607.02375
作者:Lan Feng,Wuyang Li,Eloi Zablocki,Matthieu Cord,Alexandre Alahi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Representation Distribution Matching, reference feature distributions, frozen pretrained encoders, Distribution Matching, feature distributions
备注:
点击查看摘要
Abstract:We elucidate the design space of Representation Distribution Matching (RDM), our name for the paradigm that trains a one-step image generator by matching generated and reference feature distributions under frozen pretrained encoders. We identify two design axes, how the distributions are compared and the representations they are compared in, and controlled studies along them yield three findings. First, the classical MMD, which could not train convincing generators a decade ago, becomes a strong and scalable objective once estimated right. Second, the generated batch is then the operative variable, with an optimum above 2048, far beyond customary batch sizes. Third, any single representation can be gamed, driven below the real score while images stay visibly fake, so we match against a balanced battery of encoders and evaluate with SW_r14, a Sliced-Wasserstein distance over 14 encoders that is independent of the training loss and resists gaming. Combining the preferred choices yields improved RDM (iRDM): it sets the one-step state of the art on ImageNet at SW_r14 1.30, corroborated by PickScore, a human-preference proxy our objective never optimizes, which prefers it over the prior best one-step generator on 71.2% of matched samples. The same recipe post-trains the four-step FLUX.2 [klein] into a one-step generator, surpassing the four-step version on GenEval, 0.826 to 0.794, and on PickScore, 22.76 to 22.58, in 90 H200 GPU-hours. Project page: this https URL.
25. 【2607.02372】Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis
链接:https://arxiv.org/abs/2607.02372
作者:Federico Lincetto,Gianluca Agresti,Mattia Rossi,Piergiorgio Sartor,Pietro Zanuttigh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural rendering techniques, Neural rendering, geometry and color, color appearance, Implicit Learned Representation
备注: Accepted at ECCV 2026. Project page: [this https URL](https://medialab.dei.unipd.it/paper_data/SPoILeR/)
点击查看摘要
Abstract:Neural rendering techniques allow for accurate reconstruction of the geometry and color appearance of 3D scenes. Some methods have extended their use to additional imaging modalities, such as multispectral, infrared, or polarimetric data. However, all of these approaches require expensive sensors and calibrated setups to capture new multimodal frames for each new scene. We propose Spectral and Polarimetric Implicit Learned Representation (SPoILeR), a novel method to obtain multi-view consistent renderings of unconventional modalities for scenes where either only RGB frames or very few of the additional modalities are available. Thanks to a multimodal pre-training phase, the model learns the mutual correlation between different modalities. This step allows predicting accurate renderings of unconventional modalities during a fine-tuning phase supervised only by RGB images. Experimental results show that the approach can accurately render infrared, polarimetric, and multispectral frames for scenes where no input sample captured by these types of sensors is provided.
26. 【2607.02371】VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval
链接:https://arxiv.org/abs/2607.02371
作者:Cristian-Gabriel Florea,Stelian Spînu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:million people worldwide, remain persistent obstacles, people worldwide live, handling cash remain, cash remain persistent
备注: 8 pages, 4 figures. Project repository available at: [this http URL](http://github.com)
点击查看摘要
Abstract:Over 285 million people worldwide live with a visual impairment, for whom everyday tasks such as avoiding obstacles, locating personal belongings, recognizing familiar faces, or handling cash remain persistent obstacles to personal autonomy. Existing assistive applications are typically limited to recognizing predefined categories, depend heavily on cloud connectivity, or require dedicated hardware. We present VisionAId, an Android application that turns a commodity smartphone into a real-time visual assistant. The system integrates six on-device deep learning models (metric monocular depth estimation, instance segmentation, visual and facial embeddings, face detection, and a custom banknote detector) running entirely through ONNX Runtime, with an optional cloud large language model (Google Gemini Flash) used only for narrative scene description and automatic object labeling. A distinctive contribution is a few-shot pipeline for personal objects: the user photographs an object from several angles, and the system later locates that specific instance in the environment, guiding the user toward it with augmented-reality markers, spatial audio, and distance-proportional haptics. All feedback is multimodal (Romanian speech synthesis, voice commands, vibration). On a reference device (Samsung Galaxy S21 Ultra), INT8 quantization reduces depth latency from ~1200 ms to ~491 ms, the custom banknote detector reaches an mAP@50 of 0.986, and metric depth is calibrated to below 1 cm of error within 3 m.
27. 【2607.02360】GAP-GDRNet: Geometry-Aware Monocular Visual Pose Sensing on a Single-Target Synthetic Spacecraft Dataset
链接:https://arxiv.org/abs/2607.02360
作者:Yonglong Zhang,Yang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:central perception problem, on-orbit servicing, Monocular relative pose, central perception, perception problem
备注:
点击查看摘要
Abstract:Monocular relative pose sensing is a central perception problem in non-cooperative rendezvous and on-orbit servicing. In spacecraft images, however, weak surface texture, thin appendages, illumination changes, and partial occlusion often leave only sparse and unstable geometric evidence. This article presents GAP-GDRNet, a geometry-aware attention-enhanced framework for monocular RGB-based 6D pose sensing. The method follows the geometry-guided direct regression paradigm of GDR-Net and modifies two points in the pipeline: an attention-based feature refinement (AFR) module is placed before dense geometric prediction, and a patch-level geometric self-attention (PGSA) module is inserted into Patch-PnP. AFR reinforces global spacecraft structure together with local weak-texture cues; PGSA then relates downsampled geometric patches before final pose regression. A Blender-based annotation process supplies target masks, visible-region masks, dense model-coordinate maps, camera intrinsics, and 6D pose labels for supervised training.
28. 【2607.02322】he Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection
链接:https://arxiv.org/abs/2607.02322
作者:Jincheng Tang,Yilong Zhu,Zhengyuan Xie,Jiang-Jiang Liu,Jiaxing Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown remarkable promise, generalized robotic manipulation, shown remarkable, remarkable promise, promise in generalized
备注: IROS 2026
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.
29. 【2607.02317】NEvo: Neural-Guided Evolutionary Video Synthesis for Dynamic Visual Selectivity
链接:https://arxiv.org/abs/2607.02317
作者:Yingtian Tang,Sogand Salehi,Ming Zhou,Amir Zamir,Leyla Isik,Martin Schrimpf
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:functionally specialized regions, human brain processes, hierarchically organized, functionally specialized, brain processes dynamic
备注: 10 pages, 6 figures
点击查看摘要
Abstract:The human brain processes dynamic visual input through hierarchically organized, functionally specialized regions. While recent in silico brain encoding models can synthesize optimal stimuli to probe selectivity in different brain regions, prior work has been largely limited to static images, leaving dynamic visual processing underexplored. We introduce a novel neural-guided video synthesis framework that generates stimuli optimized for target brain regions across visual cortex. Our method performs evolutionary search over a structured prompt space, guided by a dynamic encoding model that predicts voxel-level responses to video inputs. By maximizing predicted activity for a target ROI, the framework efficiently discovers hyper-activating dynamic stimuli that consistently surpass handcrafted localizer videos. The synthesized videos recover known selectivities across ventral, dorsal, and lateral pathways, and further reveal systematic differences in sensitivity to temporal dynamics. A searchlight analysis provides new insight into the progression toward increasingly complex social-dynamic features along the lateral stream, further supported by probing with synthesized abstract, non-naturalistic stimuli. Taken together, our framework enables in silico exploration of dynamic visual selectivity, with new predictions for in vivo experiments
30. 【2607.02301】InvSplat: Inverse Feed-Forward Scene Splatting
链接:https://arxiv.org/abs/2607.02301
作者:Polina Karpikova,Wenjing Bian,Haofei Xu,Hendrik Lensch,Andreas Geiger
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Inverse rendering aims, meaningful material properties, properties from images, aims to recover, Inverse rendering
备注:
点击查看摘要
Abstract:Inverse rendering aims to recover both 3D geometry and physically meaningful material properties from images, enabling applications such as relighting and novel view synthesis. Optimization-based methods achieve high fidelity but require costly per-scene fitting, while image-space learning-based approaches often suffer from multi-view inconsistencies and lack an explicit 3D representation for stable novel view rendering. We present a feed-forward multi-view reconstruction framework for inverse rendering that directly predicts a structured 3D Gaussian representation with intrinsic material attributes. Each Gaussian primitive is parameterized by mean, normal, opacity, rotation, scale, albedo, metallic, and roughness, enabling a disentangled and physically grounded scene representation. Our model integrates priors from a material estimation network with a multi-view 3D reconstruction backbone, allowing joint prediction of geometry and reflectance parameters in a single forward pass. Experiments on synthetic and real-world datasets demonstrate improved multi-view consistency compared to 2D baselines, accurate material recovery, and stable novel view rendering. Our representation further supports physically-based relighting and more faithful modeling of view-dependent effects compared to existing RGB-based feed-forward reconstruction methods. Our project webpage is: $\href{this https URL}{\text{this https URL}}$.
31. 【2607.02300】Search-based Testing of Vision Language Models for In-Car Scene Understanding
链接:https://arxiv.org/abs/2607.02300
作者:Lev Sorokin,Chen Yang,Ken E. Friedl,Andrea Stocco
类目:Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
关键词:in-car scene understanding, driver distraction, ambient lighting, supports drivers, automotive domain
备注: Accepted at the Industry Track of the 41st IEEE/ACM International Conference on Automated Software Engineering (ASE 2026)
点击查看摘要
Abstract:In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.
32. 【2607.02299】Dual-Selective Network for Domain-Incremental Change Detection
链接:https://arxiv.org/abs/2607.02299
作者:Yuzhi He,Junxi Huang,Haorui Wu,Jiahui Qu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Domain-incremental change detection, continuously adapts models, Domain-incremental change, preserving prior knowledge, continuously adapts
备注: International Conference on Artificial Neural Networks, ICANN-2026
点击查看摘要
Abstract:Domain-incremental change detection (DICD) continuously adapts models to new geographic domains while preserving prior knowledge. However, a structural mismatch exists: the label space remains fixed while domain characteristics vary drastically. Consequently, incremental models struggle to maintain stable spatial change representations across domains. Existing strategies, such as replay-based or regularization-based methods, often fail to scale to long domain sequences, leading to knowledge degradation or increased computational cost. We propose Dual-Selective Incremental Network (DSINet), a unified framework built on visual state space models. DSINet leverages Mamba's input-dependent selective mechanism through a selective spatial state unit (S3U). This unit preserves stable spatial change structures while filtering domain-specific variations during feature propagation. As a result, spatial representations remain stable across domains, preventing the accumulation of feature confusion over incremental steps. Additionally, we employ a concentration-balanced distillation (CBD) strategy to stabilize knowledge transfer across domains. It balances hardness and confidence concentration effects during incremental updates. This ensures reliable probability mass allocation and prevents over-smoothing or mode collapse during distillation. Together, these mechanisms maintain stable learning dynamics throughout incremental stages. Experimental results demonstrate that DSINet mitigates knowledge degradation across long domain sequences while maintaining the linear computational efficiency of state space models.
33. 【2607.02298】Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation
链接:https://arxiv.org/abs/2607.02298
作者:Andrei-Marian Ungureanu,Stelian Spînu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:civil applications alike, rapidly transforming modern, transforming modern warfare, applications alike, rapidly transforming
备注: 6 pages, 5 figures. Project repository available at: [this http URL](http://github.com)
点击查看摘要
Abstract:Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.
34. 【2607.02291】Optimizing Visual Generative Models via Distribution-wise Rewards
链接:https://arxiv.org/abs/2607.02291
作者:Ruihang Li,Mengde Xu,Shuyang Gu,Leigang Qu,Fuli Feng,Han Hu,Wenjie Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Conventional reinforcement learning, visual generation typically, reinforcement learning strategies, generation typically employ, typically employ sample-wise
备注: ICML 2026 Main
点击查看摘要
Abstract:Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.
35. 【2607.02290】DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
链接:https://arxiv.org/abs/2607.02290
作者:Zhaokai Wang,Mingxin Liu,Zirun Zhu,Ziqian Fan,Yiguo He,Mohan Zhang,Leyao Gu,Xiangyu Zhao,Ning Liao,Shaofeng Zhang,Xuanhe Zhou,Zhihang Zhong,Junchi Yan,Xue Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:precise spatial relations, visually appealing natural, Recent image generation, appealing natural images, produce visually appealing
备注:
点击查看摘要
Abstract:Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.
36. 【2607.02284】FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval
链接:https://arxiv.org/abs/2607.02284
作者:Zhenqi He,Ziqi Jiang,Yuanpei Liu,Yanghao Wang,Teng Wang,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Zero-shot composed image, domain-specific annotated triplets, Zero-shot composed, composed image retrieval, aims to retrieve
备注: Accept to ECCV2026
点击查看摘要
Abstract:Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.
37. 【2607.02271】AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition
链接:https://arxiv.org/abs/2607.02271
作者:Haiyang Li,Yuming Fu,Qun Song,Hongchao Liao,Jing Chen,Mounim A.EI-Yacoubi,Xin Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limited annotated data, imaging variations, technology often constrained, constrained by limited, limited annotated
备注: Preprint [this http URL](http://V1.Codebase) : [this https URL](https://github.com/Advance-VeinTech-Innovators/AGVBench)
点击查看摘要
Abstract:Vein recognition is a secure biometric technology often constrained by limited annotated data and imaging variations. While data augmentation mitigates this, strategies designed for natural images may disrupt the fine-grained topology and textures essential for identity discrimination. We present AGVBench, which evaluates 30 representative augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, covering classic CNNs, vision transformers, and vein-specific recognition models. Our results show that multi-image mixing methods (e.g., MixUp, PuzzleMix, StarMixup) generally provide the strongest recognition performance. However, they are often poorly calibrated and vulnerable to adversarial perturbations, revealing a clear inconsistency between clean accuracy and adversarial security. We also find that severe geometric transformations frequently degrade recognition, which is potentially due to feature misalignment or spatial cropping, and that augmentation effectiveness varies across palm and finger vein datasets. These findings prove that accuracy-centric evaluation is insufficient for biometric augmentation. AGVBench provides standardized protocols to support reproducible research and guide the design of reliable, secure, and robust vein recognition systems. Our codebase is available at this https URL.
38. 【2607.02269】AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models
链接:https://arxiv.org/abs/2607.02269
作者:Rintaro Otsubo,Ryo Fujii,Reina Ishikawa,Taiki Kanaya,Kanta Sawafuji,Hiroki Kajita,Shigeki Sakai,Hideo Saito,Ryo Hachiuma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:demonstrated immense promise, Spatio-Temporal Video Grounding, Video Grounding, demonstrated immense, immense promise
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential. To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability. We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.
39. 【2607.02252】ArcAD: Anomaly-Rectified Calibration for Cold-Start Supervised Anomaly Detection
链接:https://arxiv.org/abs/2607.02252
作者:Ningning Han,Lei Fan,Jia Guo,Yunkang Cao,Xiu Su,Feng Cao,Donglin Di,Tonghua Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Industrial Anomaly Detection, real-world manufacturing frequently, manufacturing frequently encounters, challenging cold-start bottleneck, deployment of Industrial
备注: Accepted to European Conference on Computer Vision (ECCV) 2026
点击查看摘要
Abstract:The deployment of Industrial Anomaly Detection (IAD) in real-world manufacturing frequently encounters a challenging cold-start bottleneck, in which limited normal samples fail to represent the full normal distribution and only a few anomalies are available. Under such a regime, existing methods struggle to form compact normal boundaries and fail to effectively exploit supervised signals from rare defects. To address this challenge, we propose Anomaly-Rectified Cold-start AD (ArcAD), a plug-and-play calibration framework for reconstruction-based IAD baselines. ArcAD follows a push-pull learning paradigm to construct a compact and discriminative normal boundary under data scarcity. On the one hand, ArcAD projects limited normal samples onto a hypersphere and pulls them into multiple compact clusters to maximize coverage of the normal manifold. On the other hand, it synthesizes pseudo-anomalies on the hypersphere and leverages real anomalies to push the boundary inward and sharpen anomaly discrimination. Extensive experiments on MVTec-AD, VisA, Real-IAD, and MANTA demonstrate that ArcAD significantly outperforms state-of-the-art supervised and unsupervised methods in both single-class and multi-class settings under cold-start conditions. Code is available at: this https URL.
40. 【2607.02237】When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression
链接:https://arxiv.org/abs/2607.02237
作者:Tien-Phat Nguyen,Ngai-Man Cheung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, cost limits deployment, computational cost limits, limits deployment, compression
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at this https URL.
41. 【2607.02230】Efficient Waste Sorting for Circular Economy: A Confidence-guided comparison between One-Vs-All and One-Vs-Rest Classification Strategies with Human-in-the-Loop for Automated Waste Sorting
链接:https://arxiv.org/abs/2607.02230
作者:Mohammed Fahad Ali,Dominique Briechle,Marit Briechle-Mathiszig,Tobias Geger,Andreas Rausch
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:European countries poses, countries poses significant, poses significant challenges, Circular Economy, regulations across European
备注:
点击查看摘要
Abstract:The complexity of waste disposal regulations across European countries poses significant challenges for the residents and hinders the transition to a Circular Economy. In Germany, the proper sorting and disposal of household waste remains challenging across municipalities. Consequently, substantially reducing incorrectly disposed waste is vital for improving waste management and advancing the Circular Economy. AI-based waste sorting solutions can support residents through user-friendly tools, such as mobile applications, that guide proper waste disposal. To be effective in supporting the Circular Economy, however, these solutions must be configurable to reflect the specific waste sorting scheme of individual municipalities in Germany. In the scope of this work, an evaluation and analysis are performed of two prominent classification strategies: OvA and OvR. The research uses a dataset constructed in alignment with the waste categories and sorting scheme of the city of Goslar in Germany. Moreover, this work aims to extend beyond the overall performance by examining the behavior of OvA and OvR classification strategies in identifying samples likely to be misclassified. These classification strategies are compared by applying varying confidence thresholds to identify uncertain samples for subsequent human review. This evaluation aims to balance the number of misclassifications against the human effort required for data annotation.
42. 【2607.02220】DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation
链接:https://arxiv.org/abs/2607.02220
作者:Zijun Li,Yimin Zhou,Jia Sun,Honglie Wang,Pengcheng Wei,Junlong Wu,Yongrui Heng,Jiyuan Wang,Huan Ouyang,Boheng Zhang,Huaiqing Wang,Dewen Fan,Qianqian Gan,Fan Yang,Tingting Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, product background synthesis, Diffusion-based generative, virtual try-on, background synthesis
备注:
点击查看摘要
Abstract:Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.
43. 【2607.02209】MedSaab-US: A Backpropagation-Free Multi-Scale Wavelet-Saab Framework for Thyroid Nodule Segmentation in Ultrasound Images
链接:https://arxiv.org/abs/2607.02209
作者:Mohammad Amanour Rahman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limited mathematical tractability, achieving high Dice, high Dice scores, Deep learning, methods dominate thyroid
备注: Accepted at the IEEE ICIP 2026 LBDL 2 Workshop
点击查看摘要
Abstract:Deep learning (DL) methods dominate thyroid nodule segmentation in ultrasound (US) images, achieving high Dice scores but at the cost of millions of parameters, GPU-dependent training via backpropagation, and limited mathematical tractability. These limitations impede deployment in resource-constrained environments. In this paper, we propose MedSaab-US, a backpropagation-free segmentation framework grounded in the Green Learning paradigm. MedSaab-US extracts multi-scale spatial-frequency features by combining multi-level Discrete Wavelet Transform (DWT) with multi-scale channel-wise Saab (Subspace Approximation with Adjusted Bias) transforms at patch sizes of 5 x 5, 11 x 11, and 21 x 21 pixels. Label-Assisted Greedy (LAG) feature selection retains the most discriminative features, which are fed to an XGBoost classifier for pixel-wise prediction. The Saab transform parameters are determined analytically from data statistics, while XGBoost employs iterative greedy tree construction without requiring backpropagation. Evaluated on the TN3K dataset (2,879 training and 614 test images), MedSaab-US achieves a mean Dice coefficient of 0.4784 +/- 0.2190, precision of 0.5768, and recall of 0.5604, with a model footprint under 500K parameters and CPU-only inference in approximately 0.3 seconds per image. We present this result as an exploratory non-DL baseline for thyroid ultrasound segmentation and analyze the specific challenges posed by isoechoic nodules. An ablation study further quantifies the contribution of each pipeline component, including separate evaluations of LAG feature selection and training-set size.
44. 【2607.02185】RadiomicNet: A Hybrid Radiomics-Guided Lightweight Architecture for Interpretable Medical Image Segmentation
链接:https://arxiv.org/abs/2607.02185
作者:Mohammad Amanour Rahman
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:substantial parameter requirements, achieved remarkable performance, Deep learning, mathematical intractability, Local Binary Pattern
备注: Accepted at the IEEE ICIP 2026 LBDL 2 Workshop
点击查看摘要
Abstract:Deep learning has achieved remarkable performance in medical image segmentation, yet it suffers from critical limitations: mathematical intractability, substantial parameter requirements, and lack of clinical interpretability. We propose RadiomicNet, a novel two-stream hybrid architecture that enhances standard deep learning by integrating handcrafted radiomics features directly into the segmentation learning process. The key contribution is the Radiomics Attention Gate (RAG), which leverages Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP) features to modulate skip-connection attention in a lightweight MobileNetV2-based encoder-decoder, providing ante-hoc interpretability without post-hoc approximations. A novel Radiomics Consistency Loss further enforces alignment between texture complexity and prediction uncertainty, reducing Expected Calibration Error (ECE) from 0.142 to 0.118. RadiomicNet achieves a Dice Similarity Coefficient (DSC) of 0.763 +/- 0.231 on the Breast Ultrasound Images (BUSI) dataset and 0.854 +/- 0.112 on Kvasir-SEG, outperforming U-KAN by 1.2% and 1.8%, respectively (p 0.05, Wilcoxon signed-rank test), with only 3.27M parameters, 9.5x fewer than standard U-Net and 4.3x fewer than U-KAN. Gradient-based feature importance analysis reveals that GLCM dissimilarity (15.24%), GLCM energy (14.56%), and LBP entropy (11.49%) are the dominant radiomics cues, providing clinically meaningful explanations for segmentation decisions. The proposed approach demonstrates that compact, interpretable models grounded in domain knowledge can deliver state-of-the-art segmentation performance with substantially reduced computational overhead.
45. 【2607.02158】Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs
链接:https://arxiv.org/abs/2607.02158
作者:Altay Toktassyn,Jurn-Gyu Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demand substantial GPU, Modern pretrained vision, making edge deployment, edge deployment impractical, substantial GPU memory
备注:
点击查看摘要
Abstract:Modern pretrained vision models achieve strong accuracy but demand substantial GPU memory for fine-tuning, making edge deployment impractical. This paper compares five parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) on Transformers- (ViT-Small, TinyViT) and Mamba-based vision backbones (Vim-Small, MambaVision-T) under an on-device VRAM budget (e.g., 2 GB), together with three gradient-checkpointing strategies (none, static, and a proposed memory-budget-aware adaptive algorithm); and we evaluate three families of foundation-model baselines: zero-shot contrastive vision language models (OpenCLIP, SigLIP), self-supervised vision backbones with lightweight evaluation protocols (DINOv2), and autoregressive VLMs for prompt-based classification (PaliGemma, MobileVLM, SmolVLM). Experiments on CIFAR-100 and DTD report accuracy, training time, energy, and the NetScore family of multi-objective metrics, which we extend with two deployment-aware variants. QLoRA and BitFit cut energy 20-30% at a 1-2% accuracy cost; the adaptive algorithm reduces peak memory 43-79% with 9-30% energy overhead. DINOv2 surpasses fine-tuned models on CIFAR-100 (0.917 vs. 0.897) at a fraction of the energy, while small autoregressive VLMs remain uncompetitive.
46. 【2607.02156】Patient-Specific Articulated Digital Twins from a Single Full-Body CT Scan
链接:https://arxiv.org/abs/2607.02156
作者:Han Zhang,Boyang Zhao,Mathias Unberath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:provide individualized context, models provide individualized, image-guided intervention, surgical planning, algorithm development
备注:
点击查看摘要
Abstract:Patient-specific anatomical models provide individualized context for surgical planning, image-guided intervention, and algorithm development. However, most CT-derived models are static: they preserve the body configuration captured at scan time, but cannot represent how the same anatomy would appear after patient repositioning. This limitation is especially important for radiographic imaging, where appearance depends jointly on imaging geometry and patient pose. We present a proof-of-concept for constructing a patient-specific articulated digital twin from a single full-body CT scan. The method fits a parametric human body model (SMPL) to obtain a patient-aligned kinematic scaffold, binds segmented bones and organs to an anatomy-aware rig, and retargets body-pose changes while preserving skeletal geometry. On three full-body CT subjects, the fitted scaffold achieved 15.8 $\pm$ 4.0 mm chamfer distance and 95.9 $\pm$ 1.8% skeletal enclosure. Recomposition at the acquisition pose preserved major radiographic structure, with overall SSIM of 0.872 $\pm$ 0.016 and PSNR of 18.5 $\pm$ 1.4 dB across paired DRRs. Across unseen target poses, the resulting twins enabled articulation while maintaining high skeletal enclosure (94.4 $\pm$ 0.4%). As a feasibility demonstration, we render the articulated twin as pose-dependent DRRs. These results suggest the feasibility of extending static, view-controllable CT simulation toward pose-controllable anatomical twins for future synthetic imaging and positioning studies.
47. 【2607.02148】SAMoR: Motion Modelling for Articulated Objects of Any Skeleton and Topology
链接:https://arxiv.org/abs/2607.02148
作者:Yuhao Zhang,Gerard Pons-Moll,Tolga Birdal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:topology remains difficult, discard motion detail, Modeling motion, skeleton topology remains, articulated objects
备注: 20 pages, 5 figures
点击查看摘要
Abstract:Modeling motion for articulated objects of arbitrary skeleton topology remains difficult: existing motion generators target a fixed human skeleton, and prior adaptations either fail to share a vocabulary across rigs or discard motion detail through global pooling. Our key observation is that while joint-level motion does not correspond cleanly across species, motion of functional joint groups does: a human arm, a wolf foreleg, and a bird wing share motion structure despite differing joint counts and connectivity, a correspondence that joint names (e.g., "forearm", "wing_L1") partially expose even when topology does not. We introduce SAMoR (Skeleton-Aware Motion Representation for Articulated Objects), a cross-topology motion representation that encodes each motion segment as a small fixed number ($K=8$) of part tokens shared across arbitrary skeletons. A graph-transformer encoder consumes per-joint motion features, kinematic graph structure, and joint-name embeddings, then compresses them into part-level tokens via cross-attention pooling and residual vector quantization, yielding a discrete motion codebook shared across rigs. To keep the part queries from collapsing into redundant global representations, we introduce a topology-agnostic attention supervision loss, with joint-name dropout to reduce over-reliance on text labels. We curate a heterogeneous corpus from HumanML3D, Truebones Zoo, and animated Objaverse-XL assets, and evaluate SAMoR on held-out characters with unseen skeletons. It supports accurate reconstruction and cross-topology transfer, and enables text-conditioned generation and part-wise editing via a MaskGIT token generator. SAMoR reaches $2.75 \times 10^{-2}$ normalized MPJPE on cross-topology reconstruction, $5.8\times$ below the strongest adapted variable-$J$ tokenizer baseline, while remaining competitive with fixed-skeleton specialists on HumanML3D.
48. 【2607.02142】Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies
链接:https://arxiv.org/abs/2607.02142
作者:Debopriya Ghosh
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
关键词:Alzheimers disease, affects memory, daily activities, brain disorder, disorder that develops
备注: Master's
点击查看摘要
Abstract:Alzheimers disease (AD) is a brain disorder that develops slowly and mainly affects memory, thinking, language, and daily activities. It is one of the most common causes of dementia and creates many difficulties for patients as well as their families. In the early stage, the symptoms are often mild and may look like normal ageing. For this reason, many people are diagnosed late, when the disease has already progressed. At present, there is no complete cure for AD. Still, early detection can help doctors manage the condition better and take suitable steps at the right time. In this study, a machine learning model is proposed to detect the early stages of Alzheimers disease using clinical details, neuropsychological test scores, and neuroimaging-related measures. The data used in this work is collected from the Alzheimers Disease Neuroimaging Initiative (ADNI). As the dataset has missing values, iterative imputation is applied to fill them. The dataset also has class imbalance, which is handled using Borderline SVM-SMOTE. After that, feature selection is carried out using wrapper-based and embedded methods so that only important features are used for training. The selected features are divided into training and testing sets, and feature scaling is applied. A stacking ensemble model is developed using Logistic Regression, Extra Trees, Bagging KNN, and LightGBM as base classifiers. Along with this, an artificial neural network is also trained on the same dataset. The performance of these models is compared using precision, recall, F1-score, and AUC-ROC. This study aims to find the best classifier and also identify important biomarkers that may help in the early diagnosis of Alzheimers disease.
49. 【2607.02139】AdaCount: Training-Free Similarity-Guided Spatial and Feature Adaptation for Zero-Shot Object Counting
链接:https://arxiv.org/abs/2607.02139
作者:Muhammad Ibraheem Siddiqui,Muhammad Haris Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:arbitrary object categories, aims to count, textual prompts, Zero-shot object counting, arbitrary object
备注: technical report
点击查看摘要
Abstract:Zero-shot object counting (ZOC) aims to count instances of arbitrary object categories specified only through textual prompts. Recent training-free approaches leverage foundation models such as SAM to reformulate counting as a prompt-driven segmentation task, eliminating the need for costly counting-specific training data with point-level annotations. More recently, SAM3 introduced promptable concept segmentation, enabling the zero-shot segmentation of all instances corresponding to a text-defined concept. However, SAM3 struggles in densely populated scenes containing numerous small objects, where limited image resolution and insufficient attention to target-relevant regions often lead to missed instances and poor instance separation, hindering accurate object counting. To address this limitation, we propose AdaCount, a training-free framework for ZOC based on similarity-guided spatial and feature adaptation. AdaCount first estimates a prototype-driven similarity map that identifies target-relevant regions. This similarity map subsequently guides two complementary adaptations: (i) similarity-guided spatial warping, which reallocates image resolution toward target instances, and (ii) feature modulation, which amplifies target-relevant encoder representations. Together, these adaptations enable SAM3 to devote greater representational capacity to target-relevant regions while preserving global image context, without requiring any model retraining. Extensive experiments across six diverse counting benchmarks establish AdaCount as a new SOTA among training-free ZOC approaches.
50. 【2607.02131】AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark
链接:https://arxiv.org/abs/2607.02131
作者:Mikołaj Jastrzębski,Dawid Glinkowski,Dawid Zieliński,Daniel Borkowski,Wojciech Kozłowski,Kamil Adamczewski
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:fundamentally challenging problem, challenging problem due, Restoring archival film, Restoring archival, paired training data
备注:
点击查看摘要
Abstract:Restoring archival film remains a fundamentally challenging problem due to the absence of paired training data and the lack of standardized evaluation benchmarks. Pristine versions of deteriorated footage are physically unrecoverable, requiring supervised methods to rely on synthetic data that often fail to capture the complex, temporally coherent nature of real film degradation. At the same time, existing real-world datasets are limited in scale, quality, and accessibility, hindering reliable evaluation and fair comparison across methods. We address both limitations with AbsoluteDegradation, a physics-inspired, modular pipeline for synthesizing realistic film degradations, and a new large-scale archival benchmark. The proposed pipeline models the analog-to-digital process as a structured composition of artifact families, incorporating signal-dependent grain, parametric scratches, and temporally coherent camera motion, enabling controlled generation of diverse degradation regimes. In parallel, we introduce a curated dataset of 81,576 high-resolution frames sourced from real archival footage, designed for consistent evaluation under real-world conditions. Together, these contributions provide a unified framework for training and benchmarking restoration models. Extensive experiments across multiple architectures show that models trained with AbsoluteDegradation generalize better to real-world footage, while the proposed benchmark reveals systematic failure modes of current methods. We hope this work establishes a foundation for reproducible and domain-authentic evaluation in archival film restoration.
51. 【2607.02099】X-Splat: Gaussian Splatting for 3D CBCT Generation from Single Panoramic Radiograph
链接:https://arxiv.org/abs/2607.02099
作者:Tomasz Szczepański,Szymon Płotka,Michal K. Grzeszczyk,Tomasz Trzciński,Arkadiusz Sitek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cone-Beam Computed Tomography, Computed Tomography, Cone-Beam Computed, curved X-ray paths, leaving depth-resolved anatomy
备注: 19 pages, 6 figures, including appendix. Under review
点击查看摘要
Abstract:Generating a 3D dental volume from a single panoramic radiograph (PXR) could provide a low-radiation alternative to Cone-Beam Computed Tomography (CBCT), but the problem is highly underdetermined: panoramic acquisition integrates 3D attenuation along curved X-ray paths into a 2D image, leaving depth-resolved anatomy unobserved. Existing implicit and generative approaches often produce oversmoothed geometry or anatomically inconsistent hallucinations, lacking geometry-driven supervision and relying on smooth representations unable to precisely localize sharp anatomical boundaries. We propose X-Splat, the first Gaussian Splatting framework for generating CBCT-like 3D dental volumes from a single PXR. X-Splat uses the known panoramic acquisition geometry as a generation scaffold: learnable anisotropic Gaussian primitives are initialized along the X-ray paths that formed the input image and adjusted in a single feed-forward pass, constrained by Beer-Lambert reprojection and multi-view radiographic training supervision. A lightweight residual refiner adds dataset-level anatomical priors without overriding the geometry already resolved by the Gaussians. We train on synthetic PXR-CBCT pairs, enabling direct volumetric supervision without paired real scans. We further introduce segmentation-based geometry-aware metrics, providing the first evaluation of PXR-based generation over maxillofacial anatomy. X-Splat outperforms NeRF- and GAN-based baselines, recovering individual teeth, cortical boundaries, and alveolar structure, including the mandibular canal which prior methods fail to reconstruct. Code will be available at this https URL
52. 【2607.02097】WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution
链接:https://arxiv.org/abs/2607.02097
作者:Wan Song,Wei Zhou,Rui Wang,Jun Yu,Toru Kurihara,Jiajia Xu,Shu Zhan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:large feature maps, small feature maps, feature maps, size grows due, Batch Matrix Multiplication
备注: 23 pages, 4 figures. Accepted as a Spotlight paper at ICML 2026. Code available at [this http URL](http://github.com/wansong-s/WBMM)
点击查看摘要
Abstract:Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at this http URL
53. 【2607.02096】LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension
链接:https://arxiv.org/abs/2607.02096
作者:Shunya Kato,Taiki Miyanishi,Shuhei Kurita,Mahiro Ukai,Nakamasa Inoue,Chenhui Chu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human activities related, videos capture rich, Video REC, Referring Expression Comprehension, understanding human activities
备注: ECCV 2026. Dataset and code: [this https URL](https://github.com/shunya-kato/LongEgoRefer)
点击查看摘要
Abstract:Egocentric videos capture rich and diverse human-object interactions and have emerged as a fundamental resource for understanding human activities related to objects. In this context, Video Referring Expression Comprehension (Video REC), the task of localizing the temporal and spatial extent of a referred object in video frames given a natural language query, plays a key role in linking textual descriptions to observed objects in untrimmed egocentric recordings. However, existing egocentric Video REC benchmarks primarily focus on short video clips, where some target object appears densely within frames. Such settings do not reflect real-world egocentric recordings, which are long-form, untrimmed, and characterized by sparse object occurrences and complex activity transitions. To address this limitation, we introduce LongEgoRefer, a novel and challenging benchmark constructed from long-form videos in the Ego4D dataset. LongEgoRefer contains 1,498 referring expressions with an average video duration of 45 minutes. The benchmark exhibits extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions embedded in long, dynamic egocentric narratives. Consequently, it defines a demanding spatio-temporal grounding problem that requires models to identify both when an event occurs and where the referred object appears within extended video sequences. We evaluate existing Video REC approaches, including training-free baselines based on vision-language models combined with Grounded SAM2. Extensive experiments show that even advanced baselines and current state-of-the-art models struggle significantly on LongEgoRefer. These results highlight the intrinsic difficulty of long-form egocentric spatio-temporal grounding and emphasize the need for more robust video understanding models.
54. 【2607.02091】Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors
链接:https://arxiv.org/abs/2607.02091
作者:Chuxi Nan,Di Wu,Hongming Guo,Ning Cao,Xiaohui Zhu,Zhaoting Shi,Jiawei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complicating preoperative decision-making, highly overlapping appearances, fibroepithelial breast lesions, appearances on B-mode, B-mode ultrasound
备注:
点击查看摘要
Abstract:Breast fibroadenoma (FA) and phyllodes tumor (PT) are fibroepithelial breast lesions with highly overlapping appearances on B-mode ultrasound, making benign and borderline PT prone to being misclassified as FA and complicating preoperative decision-making. Existing computer-aided diagnosis methods commonly rely on single-modal imaging features and insufficiently exploit complementary clinical and textual information. To address this limitation, we construct the FAPT-M Dataset, a pathology-confirmed multimodal dataset comprising 910 patients with strictly reviewed ultrasound images, structured clinical attributes, and ultrasound diagnostic descriptions. Based on this dataset, we propose a clinically guided multimodal framework that integrates DenseNet-based visual encoding, CLIP-inspired text encoding, and lightweight clinical encoding, and further introduces clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and multimodal interaction. Under patient-level five-fold cross-validation, the proposed method achieves an accuracy of 77.64%, F1-score of 73.38%, and AUC of 89.74%, outperforming representative CNN-, Transformer-, and vision-language-based baselines. Ablation studies and class-balanced evaluations further confirm the contribution of three-modality fusion and the key architectural components. Overall, this work provides an effective multimodal approach for fine-grained FA-PT classification and establishes a high-quality benchmark for multimodal breast ultrasound analysis.
55. 【2607.02090】CG-AR: Real-Time Multi-View Augmented Reality for Trading Card Game Streaming
链接:https://arxiv.org/abs/2607.02090
作者:Anthony Cioppa,Antoine Verdonck,Maxim Henry,Marc Van Droogenbroeck,Raphaël La Rocca
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flat top-down footage, live streams remain, Trading card games, broadcast online, Trading card
备注: 31 pages, 8 figures, 3 tables
点击查看摘要
Abstract:Trading card games are increasingly played and broadcast online, yet live streams remain mostly limited to flat top-down footage of the playing area. Augmenting such streams with virtual models of the played cards would improve the viewing experience, but most existing systems rely on instrumented playing surfaces and embedded chips, which are costly and impractical for casual players and large-scale events. In this work, we present TCG-AR, a novel real-time pipeline that augments trading card games using ordinary RGB cameras alone, without any physical markers or specialized hardware. Our pipeline detects, orients, and identifies the cards on the board, renders virtual content onto each card across all views, and can additionally compose a broadcaststyle view that summarizes the game state for spectators, streaming the augmented feeds to standard broadcasting software such as OBS. To train the detection, orientation, and identification models without manual labeling, we introduce an automatic procedure that generates annotated synthetic training data from a reference set of card images. Then, we evaluate several trained models on a new manually annotated dataset with real images, analyzing performance and runtime throughput that determine real-world usability. Overall, by relying only on commodity cameras and hardware, and by open-sourcing all code, models, and datasets, this work aims to serve as a reference for real-time trading card recognition and to make real-time augmented-reality streaming accessible to the broader community of players and streamers.
56. 【2607.02089】ESC: Emotional Self-Correction for Reliable Vision-Language Models
链接:https://arxiv.org/abs/2607.02089
作者:Tien-Huy Nguyen,Minh-Nhat Nguyen,Nguyen Nhat Huy,Hung Viet Nguyen,Huy Nguyen Minh Nhat,Thanh-Huy Nguyen,Cuong Tuan Nguyen,Hoang M. Le,Dat Nguyen,Phat Kim Huynh,Min Xu,Ulas Bagci
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:diverse multimodal tasks, textbf, achieved strong performance, Vision-language models, performance across diverse
备注: ECCV Main Track 2026 (113 pages, 15 tables, 65 figures). Project Page: [this https URL](https://genai4e.github.io/ESC/?)
点击查看摘要
Abstract:Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, yet they remain vulnerable to unreliable reasoning. Existing self-correction methods mitigate these issues but typically rely on post-training or carefully engineered feedback, incurring high computational cost. In this work, we revisit this challenge through the lens of emotional cues, asking whether they can activate latent self-correction behaviors in VLMs without additional training. \textbf{We find that emotional signals serve as an effective trigger for self-correction, encouraging more cautious and reflective reasoning}. Motivated by this finding, we propose \escabstract (\textbf{\underline{E}}motional \textbf{\underline{S}}elf-\textbf{\underline{C}}orrection), a training-free self-correction framework. ESC introduces an external verifier that detects potentially incorrect initial responses and injects emotional feedback to encourage model to reflect, and produce a better revised response without additional training. Extensive experiments across safety, hallucination, vision-centric perception, and multimodal reasoning benchmarks show that ESC consistently improves reliability while preserving overall model utility. These results suggest that emotion can function not only as an ability to be recognized, but also as a practical control signal for scalable self-correction in VLMs. \textbf{We therefore believe that ESC provides a strong foundation for a new reliable human-like, emotion-integrated research direction.} Our project is publicly available at \textcolor{red}{this https URL}.
57. 【2607.02083】DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction
链接:https://arxiv.org/abs/2607.02083
作者:Susmit Agrawal,Matthias Bethge,Matthias Kümmerer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding human visual, inferring cognitive states, Understanding human, human visual attention, cognitive states
备注:
点击查看摘要
Abstract:Understanding human visual attention on a scene over time has applications in domains such as interface design and inferring cognitive states. Modeling visual scanpaths has historically relied on specialized architectures with hand-crafted priors. While these architectures can model fixation sequences, their rigid structural biases restrict easy extendability and flexible conditioning. For instance, integrating task-specific instructions or adapting to distinct viewer identities requires custom, disjoint architectural additions. We frame scanpath prediction purely as a discrete sequence modeling task. By mapping coordinates into a text vocabulary, we leverage the pretrained representations of Vision-Language Models. This framing absorbs diverse factors of variation: simple prompting allows for global conditioning, such as providing viewer identities to capture personalized biases, or task-specific objectives like visual search. The framework can also integrate per-fixation attributes, such as individual fixation durations, alongside spatial locations. The autoregressive alignment enables the scalable, exact computation of per-fixation log-likelihoods, directly equivalent to the commonly used Information Gain (IG) metric. Our model, DeepGaze3.5-VL, establishes a new state-of-the-art across multiple datasets, achieving 2.18 bits of IG on MIT1003, a 46% improvement over DeepGaze III. This advantage persists even when baselines use identical high-capacity vision encoders. Beyond predictive performance, our generative framework serves as a powerful computational tool for direct behavioral interventions, allowing for controlled in-silico simulations that would be experimentally difficult or impossible to conduct in vivo. We demonstrate this ability by performing controlled interventions on the durations of pre-saccadic fixations, recovering known oculomotor phenomena purely from data.
58. 【2607.02075】HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
链接:https://arxiv.org/abs/2607.02075
作者:Yushuo Chen,Xiaoyu Shi,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Yebin Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:marker-based motion capture, egocentric video generation, unconstrained monocular video, present HandsOnWorld, hand-controlled egocentric video
备注: 17 pages, 9 figures
点击查看摘要
Abstract:We present HandsOnWorld, a framework for hand-controlled egocentric video generation that forgoes multi-view and marker-based motion capture, learning instead from unconstrained monocular video. Such generality is bottlenecked by the scarcity of scalable 3D hand annotations: large egocentric corpora lack finger-level labels, whereas precise hand datasets are confined to narrow, instrumented settings, limiting prior hand-controlled generators to restricted scene distributions. We instead annotate 3D hands directly on in-the-wild egocentric video through monocular reconstruction, introducing a protagonist-centered annotation pipeline that filters the reconstructions at the action-semantic, image-quality, and 3D-geometric levels to build EgoVid-Pro, a dataset of clean, protagonist-only hand trajectories spanning 103K clips and roughly 12M frames across diverse everyday scenes. To resolve the camera-hand entanglement induced by large ego-motion, we further propose the Plücker Hand Map, a 3D-aware control signal that extends Plücker-ray representations from camera rays to the hand surface, disentangling camera and hand motion at the representation level. Experiments show that \method surpasses prior hand-controlled generators in reconstruction fidelity and control accuracy, and generalizes to out-of-distribution everyday scenes beyond the laboratory datasets on which prior methods rely.
59. 【2607.02074】Comprehensive Robustness Analysis of LiDAR-based 3D Object Detection in Autonomous Driving
链接:https://arxiv.org/abs/2607.02074
作者:Adwait Chandorkar,Kai Krink,Yerdana Maulenbay,Hasan Tercan,Tobias Meisen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated improved detection, object detection, demonstrated improved, adversarial robustness, adversarial
备注: Accepted at ECCV 2026 main
点击查看摘要
Abstract:Recent advancements in LiDAR-only 3D object detection have demonstrated improved detection accuracy over benchmark datasets. However, the adversarial robustness of these models remains untested. Very few adversarial robustness studies exist for LiDAR-only 3D object detection and unfortunately, even they are limited to legacy models. Moreover, there is a systemic gap in the existing evaluation frameworks that rely simply on mAP ignoring other structural and predictive factors. To fill this gap, we propose a holistic framework that evaluates adversarial robustness using two structural factors (point cloud density and point cloud localization) and three predictive factors (misclassification, localization error, distance from ego). Using this framework, we perform an empirical study and critical analysis on recent and legacy state-of-the-art models using adversarial attacks specifically designed for LiDAR-based models. Our key finding is that high-capacity, voxel-based detectors are more susceptible to structured coordinate perturbations than pillar-based detectors. Additionally, non-anchor-based detectors demonstrate poor adversarial robustness, which necessitates rethinking model training techniques. Overall, our results demonstrate that recent models are as vulnerable to adversarial attacks as their predecessors. Therefore, we argue that there is a need to improve the evaluation benchmarks for 3D object detection that not only reward architectural modifications for improving detection accuracy, but also evaluate whether the design choices improve adversarial robustness.
60. 【2607.02055】Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains
链接:https://arxiv.org/abs/2607.02055
作者:Prathamesh Patil,Arpit Jain,Aswanth Krishnan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:systems commonly assumes, random dataset splits, dataset splits produce, splits produce independent, identically distributed
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Performance evaluation in AI systems commonly assumes that random dataset splits produce independent and identically distributed (i.i.d.) subsets. We show that this assumption often breaks down in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging, leading to two systematic failures: data leakage, where correlated samples span training and validation splits and inflate performance estimates, and hidden stratification, where errors on minority subpopulations are obscured by aggregate metrics. To address these issues, we propose a unified evaluation and training framework for spatially correlated data. We introduce Structure-Aware Stratified Partitioning (SASP), which constructs validation splits that reduce spatiotemporal leakage while preserving meaningful class balance, and Curriculum Distributionally Robust Optimization (CDRO), a curriculum-based relaxation of distributionally robust training that stabilizes optimization under these stricter splits. Across multiple benchmarks, this combination yields consistently improved generalization, more reliable confidence calibration, and exposes failure modes that remain hidden under conventional random-split evaluation.
61. 【2607.02051】Embracing Intra-Class Heterogeneity for Semi-Supervised Medical Image Segmentation: From Diversity to Precision
链接:https://arxiv.org/abs/2607.02051
作者:Yuqi Liu,Yufei Chen,Wei Fu,Xiaodong Yue,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semi-Supervised Medical Image, Medical Image, promising approach, medical images exhibit, Semi-Supervised Medical
备注: Accepted by Medical Image Analysis
点击查看摘要
Abstract:Due to the scarcity of expert-annotated data, Semi-Supervised Medical Image Segmentation (SSMIS) has emerged as a promising approach. Many anatomical structures in medical images exhibit significant intra-class heterogeneity, with different regions showing heterogeneous intensity patterns within the same structure. However, existing methods inadequately exploit this intensity-manifested intra-class heterogeneity, resulting in uniform structural representations and imprecise segmentation. Furthermore, the scarcity of labeled data makes it more difficult to effectively capture such complex heterogeneity. To address this, we propose Multiple Prototype Contrastive Learning (MPCL), an SSMIS framework that possesses better diversity and better precision. It consists of three novel designs: First, we provide structural representations with better diversity and propose Intensity-aligned Heterogeneous Prototype Generation (IHPG) that effectively models intra-class heterogeneity by generating multiple prototypes aligned with intensity characteristics. Second, we further enhance more diverse structural representations and build a solid foundation for more precise segmentation through Prototypical Space Optimization (PSO) that systematically optimizes a more discriminative and generalizable prototypical space. Finally, we achieve segmentation results with better precision through Dual-branch Knowledge Alignment (DKA) that efficiently promotes intra-class heterogeneity knowledge transfer from prototypical space to the segmentation network. Extensive experiments on three medical image datasets with significant intra-class heterogeneity demonstrate that MPCL significantly outperforms existing methods, especially under extremely limited labeled data.
62. 【2607.02045】PWM-ArtGen: Part World Model for Articulated Object Generation
链接:https://arxiv.org/abs/2607.02045
作者:Wentao Zheng,Ancong Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:underlying kinematic structure, key challenge, accurately predicting, predicting the underlying, single image
备注:
点击查看摘要
Abstract:The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.
63. 【2607.02038】Hierarchical Anti-Aesthetics: Protecting Facial Privacy against Customized Diffusion Models
链接:https://arxiv.org/abs/2607.02038
作者:Songping Wang,Yueming Lyu,Shiqi Liu,Chen Zhao,Ziyuan Chen,Ning Li,Jing Dong,Caifeng Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual content creation, personalized visual content, content creation, malicious misuse, fueled a boom
备注:
点击查看摘要
Abstract:The rise of customized diffusion models has fueled a boom in personalized visual content creation, but it also introduces serious risks of malicious misuse, thereby posing threats to personal privacy. Image aesthetics are strongly correlated with human perception of image quality. Motivated by this observation, we address facial privacy protection from a novel aesthetic perspective by degrading the generation quality of maliciously customized models, thus reducing facial identity leakage. Specifically, we propose a Hierarchical Anti-Aesthetics (HAA) framework that exploits aesthetic cues at multiple perceptual levels. HAA consists of two key branches: (1) Global Anti-Aesthetics, which degrades overall aesthetics and generation quality by constructing a global anti-aesthetic reward mechanism and a corresponding loss; and (2) Local Anti-Aesthetics, which disrupts facial identity by using a local anti-aesthetic reward mechanism and loss to guide adversarial perturbations toward facial regions. By integrating both branches, HAA achieves anti-aesthetic degradation from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing methods in identity removal, providing an effective tool for protecting facial privacy.
64. 【2607.02034】ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments
链接:https://arxiv.org/abs/2607.02034
作者:Lu Pan,Hongwei Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Physics-based Human-Scene Interaction, Physics-based Human-Scene, gap between kinematic, crucial for embodied, embodied intelligence
备注:
点击查看摘要
Abstract:Physics-based Human-Scene Interaction (HSI) imitation learning is crucial for embodied intelligence as it bridges the gap between kinematic 3D motions and real-world dynamics. However, most existing methods focus on simplified scene settings, leaving complex environments largely unexplored, which limits their applicability in real-world scenarios. In this paper, we focus on HSI mimicry in complex environments. Under this complex setting, we observe an inherent trade-off between successfully performing interaction and maintaining natural, physically plausible motions. To address this challenge, we propose ComplexMimic, a framework that reconstructs diverse HSI by interpreting imperfect MoCap data. First, we introduce a Dual Flow Strategy, which learns two complementary experts: an imitation expert for accurate motion tracking and an interaction expert for collision-aware adaptation in complex scenes. Second, naive multi-expert distillation, which treats all experts equally, often under-samples challenging behaviors, limiting effective learning. To mitigate this issue, we propose a difficulty-aware distillation strategy that adaptively weights supervision and prioritizes hard-yet-learnable trajectories guided by failure statistics and learning progress signals. Extensive experiments on three benchmark datasets demonstrate that our approach outperforms current state-of-the-art methods. Our implementation is available at this https URL.
65. 【2607.02025】Evaluating Vision-Language Models as a Zero-Shot Learning Alternative to You Only Look Once and Optical Character Recognition for Nigerian License Plate Recognition
链接:https://arxiv.org/abs/2607.02025
作者:Ismail Ismail Tijjani,Ahmad Abubakar Mustapaha,Sunusi Ibrahim Muhammad,Muhammad Bashir Aliyu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:License Plate Recognition, urban mobility management, Optical Character Recognition, Plate Recognition, Nigerian license plate
备注:
点击查看摘要
Abstract:License Plate Recognition (LPR) systems are critical tools in traffic monitoring, security enforcement, and urban mobility management. Traditional LPR systems often rely on a multi-stage pipeline involving object detection using You Only Look Once (YOLO) and Optical Character Recognition (OCR), which suffer from limitations such as high resource demands, poor performance in unstructured environments, and the need for large annotated datasets. This study explores the potential of Vision-Language Models (VLMs) as a unified, zeroshot learning solution for Nigerian license plate recognition. Using a curated dataset of 88 challenging real-world images collected in Nigeria, we evaluate five selected VLMs: Gemini 2.0 Flash Exp (Google DeepMind), Qwen2.5-VL-7B-Instruct (Alibaba), GPT-4o (OpenAI), Claude 4 Sonnet (Anthropic), and Llama 3.2 Vision 90b (Meta). Results based on Character Error Rate (CER) reveal that Gemini and Qwen significantly outperform other models in both accuracy and robustness, on the challenging image scenarios. This work highlights the practical advantages of VLMs over YOLO+OCR, questions the claims by model providers, and compares the performances of the VLMs.
66. 【2607.02024】Spatio-Temporal and Clinical Conditioning for Fine-Grained Radiology Report Retrieval
链接:https://arxiv.org/abs/2607.02024
作者:P. Sloan,E. Simpson,M. Mirmehdi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rising imaging demand, persistent workforce shortages, workforce shortages strain, radiology report generation, shortages strain reporting
备注: 14 pages, 2 figures, 6 tables
点击查看摘要
Abstract:Radiology is vital to modern healthcare, but rising imaging demand and persistent workforce shortages strain reporting capacity and clinical workflows. Automated radiology report generation has the potential to support radiologists and help alleviate this burden; however, existing retrieval-based methods remain rigid, lack explicit anatomical grounding, and do not account for longitudinal disease progression or available clinical context. In this work, we introduce STAR3, a multimodal, spatio-temporal, attentive retrieval framework for radiology report generation that aligns region-level anatomical information with clinical indications and longitudinal changes across chest X-ray studies. Our framework employs an object detector to identify anatomically meaningful regions and retrieves semantically relevant report sentences conditioned on both current clinical context and changes observed between prior and current examinations. This design enables anatomically and temporally grounded report generation that better reflects clinical reporting practice. Experiments on the MIMIC-CXR dataset demonstrate that STAR3 outperforms current retrieval-based approaches on retrieval, NLP and clinical metrics, highlighting the value of conditioning retrieval anatomically, temporally and clinically for advancing automated radiology report generation.
67. 【2607.02018】UnderOneFacade: Worldwide Facade Semantic Segmentation Benchmark Dataset
链接:https://arxiv.org/abs/2607.02018
作者:Yi Wang,Fan Wang,Prabin Gyawali,Ziyang Xu,Anna Klimkowska,Yixiong Jing,Wanru Yang,Filip Biljecki,Christoph Holst,Benjamin Busam,Brian Sheil,Olaf Wysocki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Globally consistent semantic, digital twins require, Globally consistent, twins require centimeter-accurate, consistent semantic digital
备注: accepted by ECCV 2026
点击查看摘要
Abstract:Globally consistent semantic digital twins require centimeter-accurate and geographically transferable 3D facade segmentation. However, progress in facade parsing is limited by the lack of large-scale, standardized benchmarks for evaluating cross-domain generalization. Existing datasets are geographically narrow, semantically inconsistent, or insufficiently precise. We introduce UnderOneFacade, the largest cross-country and cross-continent 3D facade benchmark to date, comprising centimeter-accurate point clouds with hierarchical, harmonized, and architecturally grounded semantic labels totaling 2.7 billion annotated points. Through a systematic evaluation of representative point-, graph- and transformer-based architectures, we show that current methods struggle to recognize fine-grained architectural elements and degrade significantly across geographic domains, with the best models achieving only up to 33 IoU on the fine-grained LoFG3 benchmark. By combining geometric precision with standardized semantics at unprecedented scale, UnderOneFacade establishes a rigorous benchmark for developing robust and transferable 3D segmentation models. The dataset, evaluation scripts, and pretrained models will be released upon publication.
68. 【2607.02015】Mirror Illusion Art
链接:https://arxiv.org/abs/2607.02015
作者:Xiaopei Zhu,Zeyuan Li,Jun Zhu,Xiaolin Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Mirror Illusion Art, Mirror Illusion, Illusion Art design, Illusion Art, automated Mirror Illusion
备注: CVPR 2026 Highlight, also got an Efficient CVPR award
点击查看摘要
Abstract:Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design. Our code is available at this https URL.
69. 【2607.02007】EduArt: An educational-level benchmark for evaluating art history knowledge in large language models
链接:https://arxiv.org/abs/2607.02007
作者:Gianmarco Spinaci,Lukas Klic,Giovanni Colavizza
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:aggregate measures reveal, Large language models, Large language, single disciplines, aggregate measures
备注:
点击查看摘要
Abstract:Large language models now score near ceiling on general benchmarks, but these aggregate measures reveal little about how models behave within single disciplines. Existing art-focused evaluations rely on synthetic questions and rarely report item-level properties. This paper introduces EduArt, an educational-level benchmark for art-historical knowledge and visual reasoning in multimodal LLMs. EduArt comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams, spanning two languages and seven formats from multiple choice to in-text word placement and error identification. Twelve models from six provider families were evaluated under a default answer-only condition and a motivation condition requiring written justification, and characterized using Classical Test Theory and a logistic regression isolating the effects of format, language, image presence, and model. The benchmark showed strong psychometric properties (mean discrimination 0.514, 82.3 percent good discriminators), while multiple-choice accuracy saturated near ceiling for six models, showing recognition formats alone cannot distinguish frontier models. Format was a strong independent predictor of accuracy: models exceeding 94 percent on multiple choice fell to 23.9 percent on open completion (Claude Opus 4.6) and 6.2 percent on error identification (Claude Sonnet 4.6). The motivation condition changed accuracy in a predominantly negative, family-dependent direction. These dissociations indicate that art-historical knowledge and the ability to deploy it are distinct capabilities, and that single-format benchmarks overestimate what models can reliably do. Mapping this capability profile is a precondition for responsible use of multimodal LLMs in art-historical scholarship, where tasks demand producing and manipulating content rather than selecting from fixed options.
70. 【2607.02005】A Stereo Visual SLAM System Using Object-Level Motion Estimation and Geometric Filtering Based on Cross Disparity
链接:https://arxiv.org/abs/2607.02005
作者:Sujan Kumar Dhali,Bhaskar Dasgupta
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:presents OCD SLAM, paper presents OCD, visual SLAM framework, OCD SLAM, jointly addressing dynamic
备注: 10 pages, 12 figures, 6 tables,
点击查看摘要
Abstract:This paper presents OCD SLAM, a dynamic stereo visual SLAM framework that extends ORB-SLAM2 by jointly addressing dynamic objects and dynamic features in the scene. Usual visual SLAM systems operating in dynamic environments often fail in the presence of moving objects, due to the static-world assumption used in pose estimation and mapping. To address this predicament, we introduce a novel geometric approach based on the discrepancy between disparity and a newly proposed notion called ``cross disparity'', which exploits both temporal and stereo inconsistency to identify dynamic feature points. Complementary to this feature-level motion analysis, OCD SLAM integrates a 3D object detection module (SMOKE) with Kalman filter-based object tracking to perform object-level motion classification, enabling robust separation of static and dynamic scene elements for accurate pose estimation. The proposed approach has been evaluated on various sequences from the KITTI Odometry and KITTI Raw datasets. Results demonstrate that OCD SLAM achieves significant improvement in trajectory accuracy compared to ORB-SLAM2 and several state-of-the-art dynamic SLAM methods. Ablation studies further demonstrate the effectiveness of the cross disparity module in the KITTI Raw dataset and show that this method is able to detect dynamic features that are missed by the 3D object detection scheme alone.
71. 【2607.01990】raining-free Controllable Human Motion Generation under Heterogeneous Constraints
链接:https://arxiv.org/abs/2607.01990
作者:Xiaofei Hui,Bo Yan,Haoxuan Qu,Hossein Rahmani,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attracted growing interest, enabling flexible constraint, flexible constraint enforcement, Training-free controllable motion, constraint-specific training
备注: ECCV 2026
点击查看摘要
Abstract:Training-free controllable motion generation has attracted growing interest for enabling flexible constraint enforcement without constraint-specific training. However, existing training-free methods require constraints to be continuous objective-based with differentiable losses, while many real-world requirements are criterion-based and provide only discontinuous, sparse, or even black-box feedback. In this paper, we propose Motion-Inference-as-Control (MIC), the first training-free motion generation framework that handles both continuous objective-based and criterion-based motion constraints under a shared mechanism. The key idea is to cast diffusion-based motion generation as a stochastic control problem. This perspective not only provides principled and practically effective step-wise control laws that support criterion-based constraints without requiring differentiability and naturally accommodate objective-based constraints as a special case, but also motivates a control-oriented constraint coordination mechanism that adaptively balances and reconciles motion constraints during generation. Experiments across diverse constraint settings demonstrate the effectiveness of our framework.
72. 【2607.01987】Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention
链接:https://arxiv.org/abs/2607.01987
作者:Weichen Zhou,Yawen Zou,Chunzhi Gu,Ran Dong,Haoran Xie,Chao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:self-supervised Vision Transformers, Vision Transformers, self-supervised Vision, controlled subspace intervention, subspace intervention framework
备注: Accepted to ECCV2026
点击查看摘要
Abstract:We introduce a controlled subspace intervention framework to investigate how self-supervised Vision Transformers (ViTs) encode dense geometric information. While linear probing is widely used to assess geometric representations, it treats features as a black box, failing to disentangle the underlying topology. To address this issue, we decompose the weights of converged linear probes to isolate the low-rank subspaces containing explicit geometric signals using Singular Value Decomposition (SVD). Our perspective yields three key insights: (1) Pre-training objectives determine how features are encoded. DINOv2 aligns spatial features for efficient linear extraction, while Masked Autoencoders (MAE) tend to disperse these signals, requiring a broader spatial context. (2) Explicit geometric representations are highly compressible, suggesting dense predictive heads could potentially be constrained to low-rank subspaces with minimal performance loss. (3) The layer-wise task affinity suggests that geometric precision peaks at intermediate layers before yielding to semantic abstraction in the final layers. By connecting internal encoding mechanics with downstream performance, these findings provide a basis for effective feature selection and lightweight decoder design. The source code is available at this https URL.
73. 【2607.01986】Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling
链接:https://arxiv.org/abs/2607.01986
作者:Weizhi Nie,Weijie Wang,Yuting Su
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multivariate time-series models, Multivariate time-series, point prediction accuracy, internal states rarely, states rarely expose
备注: Preprint. 37 references, 8 figures
点击查看摘要
Abstract:Multivariate time-series models for prognostics are often evaluated by point prediction accuracy, yet their internal states rarely expose a coherent degradation process. We study liquid neural networks as latent dynamics models for aircraft engine health monitoring on the C-MAPSS benchmark. The proposed model encodes a history window into a latent state, evolves that state with a liquid transition model, and decodes future sensor observations. To separate health evolution from operating-condition variation, the latent state is factorized into degradation and condition components. Remaining useful life, monotonic risk, and latent-consistency losses supervise the degradation component, while condition prediction and decorrelation losses discourage operating-condition leakage. Across FD001--FD004, the full disentangled model improves overall sensor forecasting RMSE from 0.2438 for a GRU baseline to 0.2266, with the largest gains on the multi-condition subsets FD002 and FD004. The learned degradation state also forms a clearer temporal degradation axis, reaching an average state-speed Spearman correlation of 0.5960. Direct remaining-useful-life regression remains stronger for the GRU baseline, indicating that the proposed representation is currently more effective as an interpretable world model for degradation dynamics than as a calibrated lifetime regressor. These results suggest that liquid latent dynamics can bridge predictive maintenance forecasting and inspectable health-state modeling.
74. 【2607.01984】Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency
链接:https://arxiv.org/abs/2607.01984
作者:Tasnim Shahriar
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:require controlled evaluation, convolutional neural networks, claims require controlled, lightweight convolutional neural, Newer lightweight convolutional
备注: 19 pages, 8 figure, 13 tables
点击查看摘要
Abstract:Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains
75. 【2607.01983】Open-Weather Robust 3D Detection via Dual-Critic Diffusion Alignment
链接:https://arxiv.org/abs/2607.01983
作者:Shuyao Li,Chuanxing Geng,Heyang Sun,Qiang Zhou,Jingjing Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adverse weather remains, object detection, autonomous driving, detection under adverse, remains a critical
备注: 18 pages, 6 figures, 8 tables. ECCV 2026 camera-ready
点击查看摘要
Abstract:Robust 3D object detection under adverse weather remains a critical hurdle for autonomous driving. Despite progress with LiDAR-4D radar fusion, most methods are constrained by a closed-world assumption, implicitly requiring training and test weather to align in both type and severity. This premise fails in practice: the open-ended nature of weather, and even variations within a single type like rain, cause dramatically different LiDAR degradation patterns, leading to significant performance drops in unseen conditions. To address this, we present Dual-Critic Guided Diffusion Alignment (DCDA), a weather-agnostic framework that learns to recover degraded LiDAR features toward a clean manifold. Rather than modeling specific weather types, DCDA employs a 4D radar-conditioned diffusion process to progressively refine features, guided by two complementary critics. (i) A detection-guided critic, anchored by a pre-trained clean-weather model, ensures that the refined features retain object-level discriminability and localization accuracy. (ii) A weather adversarial critic enforces holistic distributional consistency with clean-weather representations. By aligning features through semantic and distributional constraints rather than explicit weather modeling, DCDA generalizes effectively to unseen weather types and severities without requiring paired data or weather labels. We further introduce a structured open-weather benchmark with held-out type-severity combinations and extensive experiments verify DCDA's advantages.
76. 【2607.01982】MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding
链接:https://arxiv.org/abs/2607.01982
作者:Wenda Wang,Yihan Tong,Yuwei Hu,Zhewei Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
关键词:large language models, molecular large language, molecular, drug discovery, large language
备注:
点击查看摘要
Abstract:Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.
77. 【2607.01978】Multimodal Knowledge Edit-Scoped Generalization for Online Recursive MLLM Editing
链接:https://arxiv.org/abs/2607.01978
作者:Siyuan Li,Youyuan Zhang,Ruitong Liu,Junxi Wang,Jing Li
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, large language models, Online multimodal knowledge, multimodal knowledge editing, knowledge editing requires
备注:
点击查看摘要
Abstract:Online multimodal knowledge editing requires injecting a continual stream of visual-textual corrections into multimodal large language models (MLLMs) with bounded overhead and minimal disruption to unrelated behaviors. Existing editors mainly emphasize edit reliability and long-horizon stability, but rarely control the semantic boundary of each edit. Our pilot analyses of post-edit behaviors and internal neuronal activities reveal a scope gap behind reliable edits: instance-level success neither guarantees transfer to valid cross-modal variants nor prevents leakage to unrelated inputs, while edit-related cross-modal responses concentrate in deeper semantic layers. Therefore, we formulate Edit-Scoped Generalization, reframing online MLLM editing from merely correcting an instance to controlling the propagation boundary of each edit. To this end, we propose ScopeEdit, a scope-aware online editor that decomposes each update into a modality-local absorption branch and an evidence-gated shared generalization branch. The local branch supports stable edit absorption, whereas the shared branch enables cross-modal propagation only when visual and textual evidence are sufficiently aligned. Both branches perform scope-separated write geometries in orthogonal low-rank spaces and maintain branch-wise preconditioners via Sherman--Morrison recursions, yielding constant per-edit overhead. Extensive experiments across diverse benchmarks, long-horizon edit streams, MLLM backbones, real-world VLKEB scenarios, and complex vision-language architectures show that ScopeEdit consistently improves the trade-off between in-scope cross-modal transfer and out-of-scope locality, while preserving edit reliability, stability and online efficiency. Our code is available at this https URL.
78. 【2607.01973】Assessing VLM Reliability for Medical Image Quality Evaluation Under Corruption and Bias
链接:https://arxiv.org/abs/2607.01973
作者:Sofiane Ouaari,Kevin Vorwalder,Nico Pfeifer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:visual question answering, Medical Image Quality, report generation, Image Quality Assessment, pathology description
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are increasingly applied in medical tasks such as pathology description, report generation, and visual question answering. Medical Image Quality Assessment (MIQA) supports diagnostic accuracy and patient safety by determining whether images meet the standards required for clinical decision-making. Automating MIQA with VLMs may reduce workload, but their behavior under real-world conditions, where images may be degraded or textual context may affect judgments, should be further explored before deployment. We benchmark VLMs on medical image quality using the MediMeta-C dataset zero-shot across seven corruption types and five severity levels. We evaluate sensitivity to degradation patterns, the effect of corruptions on embedding geometry, and whether textual attributes (demographics, expertise, infrastructure, institution) alter scores. Across 16 VLMs and seven modalities, pixelation produced the largest score reductions (mean -20.58%, up to -34.4% for OCT), whereas brightness had limited effect (-0.81%). Embedding displacement was associated with score changes. Same-family models showed correlations of 0.67-0.83; some produced increases up to +31% for corrupted mammography. Textual attributes affected scores: institutional prestige raised them +17.15%, and equipment age lowered them -14.7%. The largest changes were +95.62% (InternVL-8B) and -37.7% (MedGemma). Current VLMs show limitations for medical image quality assessment. Pixelation, a privacy-preserving transformation, reduces performance, indicating a trade-off between patient privacy and reliability. Sensitivity to contextual metadata indicates limited objectivity and marks metadata as a privacy and bias source. Privacy protection and objective quality assessment are related requirements for use.
79. 【2607.01962】NeoMap: Training-free Novel-View Synthesis from Single Images and Videos
链接:https://arxiv.org/abs/2607.01962
作者:Jinxi Li,Tianyi Zhang,Yafei Yang,Zihui Zhang,Peng Huang,Koon Wing Macgyver Lin,Bo Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
关键词:single images, images or monocular, pre-trained video models, view, view video synthesis
备注: ECCV 2026. Jinxi and Tianyi are co-first authors. Code and data are available at: [this https URL](https://github.com/vLAR-group/NeoMap)
点击查看摘要
Abstract:We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.
80. 【2607.01952】Personalized 4D Whole-Heart Mesh Reconstruction from Cine MRI via Multi-Scale Temporal Modeling and Differentiable Contour Rendering
链接:https://arxiv.org/abs/2607.01952
作者:Xiaoyue Liu,Dongcheng Cang,Xiaohan Yuan,Mark YY Chan,Ching-Hui Sia,Lei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, cardiac digital twins, creating cardiac digital, sparse cine MRI, cine MRI
备注: 15 pages
点击查看摘要
Abstract:Accurate 4D whole-heart mesh reconstruction from sparse cine MRI is critical for creating cardiac digital twins, but remains challenging due to limited 2D slice coverage and the complex coupling between cardiac shape and motion. Existing methods often rely on intermediate contour fitting and typically reconstruct static, single-phase, or partial cardiac geometries, limiting their ability to capture full-chamber dynamics. We propose a novel end-to-end framework for reconstructing temporally resolved whole-heart meshes from multi-view 2D cine MRI sequences by learning an image-to-mesh mapping. The framework incorporates a differentiable contour renderer inspired by the Beer-Lambert attenuation principle, enabling anatomy-aware supervision of 3D+t mesh deformation through contour-based projection losses. To improve temporal consistency across the cardiac cycle, we further introduce a multi-scale temporal modeling module that integrates global cycle-level dynamics with local inter-frame coherence to generate smooth and physiologically plausible mesh trajectories. The proposed method achieved a whole-heart mean absolute error of 1.68 $\pm$ 0.31 mm and a motion jitter of 0.77 $\pm$ 0.17 $\mathrm{mm}/\mathrm{frame}^{3}$, outperforming existing methods with lower reconstruction error and substantially improved motion smoothness. It also improved 2D contour alignment across multiple cine MRI views and supported downstream proof-of-concept electrophysiological simulation. The code will be released publicly upon acceptance of the manuscript for publication.
81. 【2607.01949】LiZAD: A Lightweight Zero-Shot Anomaly Detection Framework for Industrial Manufacturing
链接:https://arxiv.org/abs/2607.01949
作者:Uzair Khan,Luigi Capogrosso,Muhammad Aqeel,Francesco Setti,Michele Magno,Marco Cristani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:characteristics frequently change, modern high-throughput industrial, visual characteristics frequently, product configurations, frequently change
备注: Accepted at the IEEE International Conference on Omni-Layer Intelligent Systems (COINS) 2026
点击查看摘要
Abstract:In modern high-throughput industrial production lines, product configurations and visual characteristics frequently change, making it impractical to collect and annotate data for every new scenario. This dynamic setting makes Zero-Shot Anomaly Detection (ZSAD) particularly suitable, as it enables defect detection without requiring training on target-specific samples. Although recent ZSAD approaches show promising results, they are computationally intensive and thus unsuitable for deployment on resource-constrained devices. We propose LiZAD: a lightweight framework designed for real-time ZSAD specifically tailored for use on edge devices. The proposed approach pairs the dense and spatially aware visual features of DINOv3, crucial for precise pixel-level localization, with the highly computationally efficient text embeddings of MobileCLIP2. These features are then mapped into a shared latent space via low-memory trainable projection heads. Compared to six state-of-the-art ZSAD models, LiZAD achieves an average memory reduction of 61.5%, a parameter reduction of 74.6%, and a speedup of 3.02x in terms of latency. Despite substantial reductions in computational and memory costs, our approach maintains competitive anomaly detection performance, dropping the average P-AUROC by just 6.4% relative to the best state-of-the-art model across the VisA, BTAD, MPDD, and MVTec-AD datasets. Finally, it is successfully deployed on the NVIDIA Jetson NX and Jetson AGX edge devices and tested on the real production line of the Industrial Computer Engineering Laboratory (ICE Lab) at the University of Verona. The code is available at this https URL.
82. 【2607.01938】PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation
链接:https://arxiv.org/abs/2607.01938
作者:Peng Yun,Shouwang Huang,Hao Li,Jinxi Li,Jianan Wang,Bo Yang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:environments remains challenging, dynamically moving targets, Manipulating fast, targets in unstructured, environments remains
备注: ECCV 2026. Code and data are available at: [this https URL](https://github.com/vLAR-group/PhysMani)
点击查看摘要
Abstract:Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.
83. 【2607.01928】Sparse-Aware Vector Quantization for Bandwidth-Efficient Collaborative 3D Semantic Occupancy Prediction
链接:https://arxiv.org/abs/2607.01928
作者:Feng Li,Chaokun Zhang,Gong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling multiple vehicles, exchange complementary perceptual, perception extends single-agent, extends single-agent perception, complementary perceptual information
备注: Accepted by ECCV26
点击查看摘要
Abstract:Collaborative perception extends single-agent perception by enabling multiple vehicles to exchange complementary perceptual information. However, it introduces an inherent trade-off between perception gain and communication overhead, which is particularly severe for 3D semantic occupancy prediction that relies on fine-grained spatial structures. Existing methods typically compress 3D features into 2D, causing severe spatial information loss, or transmit dense 3D representations, hindering real-world deployment. To overcome these limitations, we propose a bandwidth-efficient collaborative Vector Quantization Semantic Occupancy Prediction (VQSOP) framework. VQSOP employs a Sparse-Aware Vector Quantization (SAVQ) mechanism that exploits 3D scene sparsity to compactly encode informative regions, drastically reducing communication overhead while preserving complete geometric context. Furthermore, to enhance structural consistency and feature continuity, we design a Dual-Branch Adaptive Spatial Refinement (ASR) module that dynamically fuses local high-frequency details with broad contextual semantics. Extensive experiments demonstrate that our approach achieves state-of-the-art performance while reducing communication volume by up to 82x.
84. 【2607.01915】Robust Image Processing Techniques for Construction Environment Monitoring Using Underwater Robots
链接:https://arxiv.org/abs/2607.01915
作者:Seunghee Yun,Geonmo Yang,Juhui Lee,Changbeom Park,Jeahyung Choi,Younggun Cho
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:targeting complex degradations, complex degradations observed, construction environment monitoring, underwater robot-based construction, robot-based construction environment
备注: 8 pages, 9 figures
点击查看摘要
Abstract:This paper proposes a robust image processing framework for underwater robot-based construction environment monitoring, targeting complex degradations observed in real marine environments. Unlike conventional approaches that mainly consider absorption and backscattering, real underwater imagery is strongly affected by depth-dependent forward scattering blur and particle-induced degradations such as marine snow. To address this, we introduce a staged processing pipeline that sequentially models background degradation via depth-aware forward scattering and foreground degradation using realistic marine snow patterns extracted from real images. The resulting synthetic data are used to retrain an existing Joint-ID network without modifying its architecture, enabling an isolated evaluation of dataset realism. In addition, a lightweight post-processing scheme is applied to enhance contrast and structural clarity. Experiments on real underwater datasets collected in Korean coastal environments demonstrate consistent improvements in visual quality and UIQM scores. The results indicate that explicitly modeling forward scattering and realistic particle effects effectively reduces the synthetic-to-real gap and improves practical applicability in real-world underwater robotic operations.
85. 【2607.01908】owards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports
链接:https://arxiv.org/abs/2607.01908
作者:Bingcong Yan,Chunlei Li,Jingliang Hu,Yilei Shi,Xiao Xiang Zhu,Lichao Mou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision-language models, remains limited due, Large vision-language, medical imaging tasks, ultrasound remains limited
备注: Project Page: [this https URL](https://medai-t.github.io/LUMI/)
点击查看摘要
Abstract:Large vision-language models (LVLMs) have achieved strong performance across many medical imaging tasks, yet their application to ultrasound remains limited due to its inherent complexity and variability. In this work, we revisit what is truly needed to enable real-world ultrasound understanding. Instead of introducing complex architectures or elaborate training strategies, we show that data scale and clinically faithful data alignment are the key factors. We construct a large-scale dataset of 1.5M real-world ultrasound examinations, containing 17.7M images, multi-organ coverage, and paired uncurated clinical reports. Crucially, we organize the data at the examination level, aligning multiple images with their corresponding reports to reflect real clinical workflows. We then fine-tune a standard LVLM using low-rank adaptation (LoRA) on this dataset without task-specific modifications. Surprisingly, this simple recipe already leads to strong performance across diverse ultrasound understanding tasks, outperforming prior methods designed with more complex pipelines. Beyond these results, we present model and data scaling analyses that provide insights into the role of scale in ultrasound LVLMs.
86. 【2607.01907】Population-Based Multi-Objective Training of Discriminators for Semi-Supervised GANs
链接:https://arxiv.org/abs/2607.01907
作者:Francisco Sedeño,Francisco Chicano,Jamal Toutouh
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Semi-supervised generative adversarial, generative adversarial networks, exploit large unlabeled, large unlabeled datasets, Semi-supervised generative
备注: The 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)
点击查看摘要
Abstract:Semi-supervised generative adversarial networks (SSL-GANs) can exploit large unlabeled datasets while retaining a classifier in the discriminator, but their training is often unstable. This paper proposes a population-based evolutionary training strategy in which discriminator learning is formulated as a multi-objective optimization problem. Instead of aggregating the supervised and unsupervised components of the SSL objective into a single scalar loss, the method maintains a population of discriminators ranked by Pareto dominance, enabling the exploration of different trade-offs between classification accuracy and real/fake discrimination. This formulation aims to improve both roles of SSL-GANs: learning accurate classifiers and training generators capable of producing realistic samples. We analyze several variants, including an elitist strategy and a mono-objective ablation, to assess the role of multi-objective selection. Experiments on MNIST with limited labels show improved training robustness compared to SSL-GAN and CE-SSL-GAN state-of-the-art baselines, while the elitist variant consistently achieves the highest classification accuracy.
87. 【2607.01906】SFKD: Spatial--Frequency Joint-Aware Heterogeneous Knowledge Distillation via Multi-Level Wavelet Spectral Interaction
链接:https://arxiv.org/abs/2607.01906
作者:Cuipeng Wang,Haipeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing knowledge distillation, distillation methods focus, knowledge distillation methods, knowledge distillation, Heterogeneous Knowledge Distillation
备注: Accepted by ECCV 2026
点击查看摘要
Abstract:Most existing knowledge distillation methods focus on homogeneous models (e.g., CNN-to-CNN), thereby overlooking the flexibility and potential of knowledge transfer across heterogeneous models. Due to intrinsic inductive bias discrepancies between heterogeneous models that cause spatial distribution inconsistencies, prior heterogeneous distillation methods often weaken or discard spatial information in heterogeneous representations. However, the spatial information in representations often encodes transferable global structural semantics as well as architecture-specific local details, and therefore should not be directly ignored. To better leverage the spatial information encoded in heterogeneous representations, we propose a Spatial-Frequency Joint-Aware Heterogeneous Knowledge Distillation framework (SFKD). By leveraging the complementary properties of wavelet transform spatial locality and Fourier representations in characterizing global energy distributions, we first apply multi-level discrete wavelet transform to explicitly decouple spatial information. The resulting wavelet sub-bands are further refined by a dual-stream dual-stage refinement module, and finally combined with a Gaussian-filtered frequency loss to selectively capture informative global information. Extensive experiments on multiple benchmark datasets under both homogeneous and heterogeneous models demonstrate the superiority of our method.
88. 【2607.01902】Rethinking Post-Hoc Calibration in Semantic Segmentation
链接:https://arxiv.org/abs/2607.01902
作者:Tristan Kirscher(ICube),Kim-Celine Kahl(DKFZ),Balint Kovacs(DKFZ),Maximilian R. Rokuss(DKFZ),Klaus Maier-Hein(DKFZ),Xavier Coubez,Philippe Meyer(ICube),Sylvain Faisan(ICube)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Reliable confidence estimates, Reliable confidence, mislead downstream decisions, safety-critical settings, settings where overconfident
备注:
点击查看摘要
Abstract:Reliable confidence estimates are essential in semantic segmentation, especially in safety-critical settings where overconfident errors can mislead downstream decisions. Yet modern segmentation models often remain miscalibrated. Post-hoc calibration offers a practical way to correct confidence estimates without retraining the segmentation model, but its use in dense prediction raises structural issues that are often overlooked. We study two such issues. First, adding a constant to all logits leaves the softmax probabilities unchanged, but several standard calibrators can still depend on this arbitrary offset. As a result, two logit representations encoding the same predictive distribution may yield different calibrated probabilities. We define translation-invariant (TI) calibrators as those whose outputs are unchanged under such shifts, characterize which common calibrators satisfy this property, and construct TI counterparts of shift-sensitive calibrators to isolate the effect of removing representation dependence. Second, post-hoc calibration is typically fitted by minimizing a likelihood-based objective, whereas segmentation models are trained with task-specific metrics such as Dice. This mismatch can cause calibration to alter class orderings and degrade the deployed segmentation map. We study decision-preserving calibration under argmax- and order-preservation constraints. Since enforcing these constraints collapses affine softmax calibrators to temperature scaling, we introduce class-conditional affine calibrators that can be made argmax- or order-preserving while retaining greater expressivity, allowing us to quantify the calibration-segmentation trade-off induced by decision preservation. Across natural-image and medical segmentation benchmarks, and under corruption-based covariate shift, matched comparisons show that TI variants generally improve calibration metrics, while decision-preserving variants prevent segmentation degradation and retain strong calibration performance. These results provide practical design principles for well-defined post-hoc calibration pipelines in semantic segmentation.
89. 【2607.01900】FoundDP: Revisiting Weak Disparity Observability in Dual-Pixel Depth Estimation
链接:https://arxiv.org/abs/2607.01900
作者:Fengchen He,Hao Xu,Dayang Zhao,Tingwei Quan,Shaoqun Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single camera, camera using sub-aperture, Dual-pixel, disparity, depth
备注:
点击查看摘要
Abstract:Dual-pixel (DP) imaging enables metric depth estimation from a single camera using sub-aperture disparity. However, the extremely small effective baseline limits disparity observability, leading to structural degradation and depth failure in textureless, low-contrast, or downsampled regions. Existing DP-based methods rely primarily on local disparity cues and therefore become unreliable when disparity signals are weak or ambiguous. To address this limitation, we propose \emph{FoundDP}, a unified framework that integrates metric DP depth with global structural priors from a monocular depth foundation model. Our method preserves metric scale through DP-derived depth and leverages Vision Transformer (ViT) features to restore structural consistency in weak-disparity regions. To ensure reliable metric guidance under DP imaging conditions, we identify and mitigate ViT representation degradation induced by DP defocus blur via ViT feature alignment, enabling stable metric-guided depth estimation. Extensive experiments on synthetic and real-world DP benchmarks show that FoundDP delivers superior performance, with consistent gains in structural fidelity and metric accuracy, especially under reduced disparity observability. Code will be available at: this https URL
90. 【2607.01885】Diversity-aware View Partitioning for Scalable VGGT
链接:https://arxiv.org/abs/2607.01885
作者:Jinsoo Park,Donggyu Choi,Ahyun Seo,Minsu cho,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:VGGT achieve strong, achieve strong performance, Geometry transformers, achieve strong, jointly reasoning
备注: 34 pages, 11 figures, Accepted to ECCV 2026
点击查看摘要
Abstract:Geometry transformers such as VGGT achieve strong performance by jointly reasoning over multiple views with global attention. However, scaling them to large view collections remains challenging due to the quadratic cost of attention. Moreover, our empirical analysis reveals that the reconstruction quality in VGGT is sensitive to the distribution of viewpoints. Simply increasing the number of views without sufficient viewpoint diversity can even degrade performance, as redundant views introduce highly similar tokens that dilute informative geometric signals in the attention mechanism. Motivated by this observation, we propose a training-free and plug-and-play VGGT inference framework that organizes views into diversity-aware balanced chunks. The chunks are constructed through combinatorial graph partitioning over visual dissimilarity and spatial dispersion. This view organization allows the transformer to focus attention on geometrically informative views while reducing redundant attention interactions. To estimate spatial dispersion without full pose estimation, we approximate spatial relationships via a soft pose propagation strategy based on visual similarity from a small set of seed frames. Extensive experiments demonstrate improved performance in camera pose estimation, multi-view depth prediction, and 3D reconstruction while reducing memory usage and inference latency. Our framework also complements existing VGGT variants, enabling scalable multi-view reconstruction without sacrificing geometric fidelity.
91. 【2607.01876】SAB-LVLM: Significance-Aware Binarization for Large Vision-Language Models
链接:https://arxiv.org/abs/2607.01876
作者:Qi Lyu,Jiahua Dong,Baichen Liu,Xudong Wang,Mingfei Han,Yulun Zhang,Fahad Shahbaz Khan,Salman Khan,Lianqing Liu,Zhi Han
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, severely limiting real-world, achieved remarkable progress, cross-modal computation incur, computation incur substantial
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal understanding, yet their enormous parameter scale and cross-modal computation incur substantial memory and latency overhead, severely limiting real-world deployment on resource-constrained devices. Binarization offers an attractive solution by drastically reducing storage and computational costs. However, existing binarization methods neglect the varying importance of weights across different layers and modalities. This causes parameters irrelevant to downstream tasks to be unnecessarily retained, whereas modality-critical weights may not be adequately optimized, resulting in significant performance degradation. To address these challenges, we develop a novel \underline{S}ignificance-\underline{A}ware \underline{B}inarization for \underline{L}arge \underline{V}ision-\underline{L}anguage \underline{M}odels (SAB-LVLM). Specifically, after constructing Hessian matrices for textual and visual inputs, we propose a spatial significance map to distinguish full-precision weights activated under a single modality from those activated across modalities. We then devise a modality-guided integration strategy to obtain the significance-aware binarization map, which measures weight significance across layers and modalities. Subsequently, this binarization map is incorporated into the binarization objective as an error reweighting term, and binarization fitting is performed through an alternating significance-weighted update scheme. Extensive experiments illustrate the superiority of our SAB-LVLM over existing binary PTQ methods under an approximately 1-bit compression constraint. Our code is accessible at this https URL.
92. 【2607.01871】Descriptor: LYNRED Mobility Dataset Multimodal Detection Subset (LYNRED-MDS)
链接:https://arxiv.org/abs/2607.01871
作者:Loïc Arbez(Thoth),Jessy Matias,Xavier Brenière,Jocelyn Chanussot(Thoth),Ronald Phlypo(GIPSA-VIBS)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:minimizing post-collision damage, Current road safety, safety systems primarily, Current road, post-collision damage
备注:
点击查看摘要
Abstract:Current road safety systems primarily focus on minimizing post-collision damage. However, advances in algorithmic perception are shifting focus toward early collision prediction, especially in lowvisibility conditions like nighttime or fog, where thermal infrared sensing outperforms both human vision and RGB imaging. While available RGB-infrared datasets such as FLIR ADAS and LLVIP are good benchmarks, they mostly consist of clear weather and overly simple scenarios. In this article, we introduce the LYNRED-MDS: Multimodal Detection Subset, a subset of the LYNRED Mobility Dataset, comprised of 4000 RGB-infrared image pairs captured under diverse weather, lighting, and road conditions around Grenoble, France. Our dataset spans varied driving contexts (urban, rural, mountainous, etc.) and a vehicle fleet compliant with Western European standards. Thermal cross-dataset evaluation using a YOLOv8n baseline suggests that our dataset offers strong generalization potential for pedestrian detection in driving scenarios. By covering critical edge cases, our dataset supports the development of more reliable and deployable vision systems for advanced driver-assistance systems.
93. 【2607.01869】QWERTY: Training-Free Motion Control via Query-Warped Video Diffusion Transformers
链接:https://arxiv.org/abs/2607.01869
作者:Kyobin Choo,Youngmin Kim,Hyunkyung Han,Geunrip Park,Chanyoung Kim,Sunyoung Jung,Seong Jae Hwang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:temporally coherent videos, Video diffusion transformers, generate high-fidelity, primarily relying, high-fidelity and temporally
备注: 37 pages, 18 figures, accepted at the European Conference on Computer Vision (ECCV) 2026
点击查看摘要
Abstract:Video diffusion transformers (DiTs) generate high-fidelity and temporally coherent videos, yet motion control remains implicit, primarily relying on text prompts. As a result, achieving desired motion often requires extensive prompt engineering and repeated resampling. While fine-tuning models with additional spatial prompts (e.g., bounding boxes or point trajectories) enables explicit control, it demands substantial data curation and computation, and may compromise the generative capabilities of pretrained models. Consequently, training-free motion control using such spatial prompts has been explored in U-Net-based video diffusion models, but remains largely unexplored for DiTs. We introduce QWERTY, a training-free framework that enables flexible motion control in pretrained image-to-video DiTs via user-defined object warping and optical flow. We carefully manipulate the 3D full attention of DiTs by warping the frame-invariant semantic subspace of queries. We find that the noise predicted by the query-warped DiT naturally guides the diffusion trajectory toward the desired motion, and further show that leveraging this noise as self-guidance for latent optimization improves control stability and visual quality. Experiments show that QWERTY achieves the most effective motion control among existing training-free approaches on a recent image-to-video DiT, with performance comparable to fine-tuning-based methods.
94. 【2607.01860】DL-SLAM: Enabling High-Fidelity Gaussian Splatting SLAM in Dynamic Environments based on Dual-Level Probability
链接:https://arxiv.org/abs/2607.01860
作者:Ziheng Xu,Qingfeng Li,Xuefeng Liu,Chen Chen,Jianwei Niu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic Simultaneous Localization, Localization And Mapping, Simultaneous Localization, dense dynamic Simultaneous, enabled significant progress
备注:
点击查看摘要
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in dense dynamic Simultaneous Localization And Mapping (SLAM). Prevailing methods typically discard predefined dynamic objects, ignoring that transiently static objects offer valuable geometric constraints for pose estimation. A recent work attempts to leverage this potential by employing per-pixel uncertainty maps to quantify the magnitude of motion. While this approach enables transiently static objects to enhance pose estimation, it erroneously integrates these objects into the static map, resulting in persistent artifacts. Moreover, its reliance on purely geometric information leads to ambiguous object boundaries in the uncertainty maps. To overcome these limitations, we present DL-SLAM, a monocular Gaussian Splatting SLAM system built upon a novel dual-level probabilistic framework. Our method computes dynamic probability maps by combining semantic and geometric information. These pixel-level probabilities are lifted to 3D and aggregated to derive an object-level dynamic probability for each instance. Object-level probability enables the categorical pruning of dynamic Gaussians, resulting in an artifact-free static map. The static map, in turn, provides a geometrically consistent guidance to refine the pixel-wise probabilities, enhancing their reliability. Experimental results demonstrate that DL-SLAM outperforms existing approaches, improving tracking accuracy by up to 13\% while generating high-fidelity semantic maps.
95. 【2607.01851】Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction
链接:https://arxiv.org/abs/2607.01851
作者:Clémentine Grethen,Florient Chouteau,Géraldine Morin,Simone Gasparini
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strict hardware constraints, hardware constraints, planetary exploration, severely restricted, computationally demanding
备注: Accepted to ECCV 2026, code can be accessed via [this https URL](https://clementinegrethen.github.io/publications/ECCV.html)
点击查看摘要
Abstract:Large 3D foundation models such as MASt3R achieve state-of-the-art stereo reconstruction but are computationally demanding for deployment under strict hardware constraints -- a critical limitation in domains such as planetary exploration, where onboard computing is severely restricted. We study how far such models can be compressed through knowledge distillation, using lunar stereo reconstruction as a challenging and practically relevant case study. Starting from a 688M-parameter MASt3R teacher fine-tuned on lunar imagery, we distill its dense geometric predictions into a family of lightweight students spanning different encoder types (CNN vs ViT), decoder widths and depths, and training strategies. To bridge the dimensional mismatch between teacher and student, we propose a structured SVD-based initialization that projects the teacher's decoder weights into the student's smaller latent space, yielding a warm start that significantly improves convergence and final performance. Based on our results on lunar data, we can obtain a distilled student that retains most of teacher's reconstruction accuracy while reducing the model size up to 7 times, and even outperforms a baseline trained directly with sparse ground-truth annotations. Beyond compression, our study highlights both principles and practical insights for distilling geometric foundation models: a convolutional encoder underperforms transformer-based alternatives (though pretraining availability remains a confounding factor), preserving encoder capacity is more critical than maintaining a large decoder, feature-level distillation consistently outperforms output-only supervision, and SVD-based initialization improves optimisation stability. These findings provide practical guidelines for deploying 3D reconstruction models in resource-constrained environments.
96. 【2607.01827】C2E: Boosting Ego-Only 3D Object Detection via Multi-Teacher Contrastive Knowledge Distillation
链接:https://arxiv.org/abs/2607.01827
作者:Jinlong Wang,Xun Huang,Qiming Xia,Shijia Zhao,Chenglu Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving systems, traditional Ego-only Perception, object detection, driving systems, detection is essential
备注: 18 pages, 8figures
点击查看摘要
Abstract:LiDAR-based 3D object detection is essential for autonomous driving systems. However, traditional Ego-only Perception (Eo-Perception) suffers from limited perspective and occlusions in a complex outdoor environment, leading to performance bottlenecks. Recently, research on multi-agent Collaborative Perception (Co-Perception) has demonstrated excellent performance, but high communication costs and accumulated pose error hinder its application. To address this, we explore a novel C2E (Co-Perception to Eo-Perception) paradigm through the Multi-to-Single (M2S) agent contrastive knowledge distillation framework. Our M2S framework first designs Multi-Level Feature Enhancement module to provide more stable features, and introduces Auxiliary Point Cloud Reconstruction and Multi-Teacher Contrastive Distillation mechanisms to mitigate domain gaps in point cloud and feature distributions within the C2E paradigm. Benefiting from this, our M2S can retain the excellent performance of collaborative perception while effectively avoiding the drawbacks, such as communication delays and positioning errors. Extensive experiments on the V2XSet, V2V4Real and DAIR-V2X datasets show the effectiveness and generalizability of our M2S framework when combined with the state-of-the-art CoSDH model and other excellent 3D detectors. Our M2S framework can deliver up to a 8.64% improvement in 3D mAP performance without introducing any communication costs.
97. 【2607.01825】Rethinking Conditional Generation for Underwater Salient Object Detection
链接:https://arxiv.org/abs/2607.01825
作者:Hua Li,Yongjie Weng,Yutong Li,Zhiyuan Li,Runmin Cong,Sam Kwong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Salient Object Detection, images remains challenging, remains challenging due, conventional SOD methods, conventional SOD
备注:
点击查看摘要
Abstract:Salient Object Detection in underwater images remains challenging due to low contrast, uneven illumination, and color distortion caused by scattering and absorption effects, which limit the effectiveness of conventional SOD methods in underwater environments. To address these challenges, we propose a Degradation-aware Conditional Generation Network (DCGNet), specifically designed to construct reliable conditional features for underwater saliency generation. First, we design a Dynamic Multi-Granularity module (DMG) grounded in the human visual system to robustly detect salient objects of varying scales with blurred boundaries. Then, we develop an Underwater Physics-Prior module (UPP), which utilizes pseudo-depth guidance to estimate underwater light attenuation and backscatter, thereby restoring degradation-aware RGB features and mitigating color distortion and boundary ambiguity. Based on the physics-guided representation, we introduce an Underwater Spatial Gaussian module (USG), which constructs a spatial Gaussian saliency prior from the strongest guided response to enhance object-centered salient regions and suppress cluttered underwater backgrounds. In addition, a lightweight timestep-adaptive Diffusion Transformer (DiT) bottleneck is inserted into the denoising decoder to refine fused features at different diffusion timesteps. Comprehensive experiments on USOD10K, USOD, CSOD10K, MAS3K, and RMAS demonstrate that DCGNet significantly outperforms existing state-of-the-art methods, verifying its potential for complex underwater visual applications.
98. 【2607.01813】MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models
链接:https://arxiv.org/abs/2607.01813
作者:Yuanzhi Liu,Shousheng Zhao,Bo Zhou,Kongming Liang,Zhanyu Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:assessing vision-language models, making them vulnerable, temporal staleness, costly maintenance, essential for assessing
备注:
点击查看摘要
Abstract:Evaluation benchmarks are essential for assessing vision-language models (VLMs), but most multimodal benchmarks are static, making them vulnerable to temporal staleness, data contamination, and costly maintenance. We present MMBench-Live, a continuously evolving multimodal benchmark built by a multi-agent-driven automated pipeline. Our framework treats benchmark evolution as task-guided dataset construction, integrating structured benchmark specification, feedback-controlled real-time data acquisition, and verifiable QA generation with executable reasoning. To maintain cross-version comparability, we introduce a distribution-consistent update strategy that extracts task-related visual patterns from the original benchmark to guide data collection and filtering. Instantiated from MMBench, MMBench-Live contains 5.9K newly generated evaluation instances with a high answer correctness rate, while each update costs about USD 30 and takes 1-2 hours. Extensive evaluations show that MMBench-Live preserves stable model rankings, maintains semantic alignment with the original benchmark, and exhibits weaker contamination-related memorization signals, suggesting a practical and scalable paradigm for sustainable multimodal benchmark evolution. The project is available at this https URL.
99. 【2607.01803】PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation
链接:https://arxiv.org/abs/2607.01803
作者:Duy Cao,Phong Nguyen-Ha
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:data remain significant, achieved impressive results, remain significant bottlenecks, impressive results, data remain
备注: Accepted at ECCV 2026
点击查看摘要
Abstract:Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.
100. 【2607.01784】SpaceEra++: A Unified Framework Towards 3D Spatial Reasoning in Video
链接:https://arxiv.org/abs/2607.01784
作者:Weili Guan,Haoyu Zhang,Meng Liu,Qianlong Xiang,Yaowei Wang,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual-spatial understanding, infer object relationships, embodied interaction, ability to infer, layouts from visual
备注: Accepted by IEEE TPAMI 2026
点击查看摘要
Abstract:Visual-spatial understanding, defined as the ability to infer object relationships and scene layouts from visual inputs, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, pre-trained vision-language models (VLMs) remain constrained by spatial uncertainty stemming from inherently 2D observations and by the scarcity of data for 3D spatial understanding. To address these limitations, we proposed a novel framework, SpaceEra, in the NeurIPS 2025 Spotlight paper. Although it achieved significant performance gains, we further observed that its effectiveness is hindered by insufficient input from scanning videos and weak reasoning constraints. To tackle these newly emerged challenges, we extend the original framework into a comprehensive system, termed SpaceEra++, which spans data construction, model design, training optimization, and prompting inference. Specifically, to alleviate input insufficiency, we introduce ScenePick, a frame sampling strategy that balances spatial coverage with object semantics to produce compact yet comprehensive scene representations. In addition, to enhance spatial reasoning, we develop SpaceAlign, which enforces pairwise object constraints by jointly exploiting absolute coordinates and relative spatial relations, thereby aligning optimization with spatial accuracy. Extensive experiments across multiple benchmarks demonstrate consistent improvements over strong baselines, while ablation studies validate both the individual and joint contributions of each component, and further analyses provide guidance for future research.
101. 【2607.01772】LLM-Empowered Multimodal Fusion Framework for Autonomous Driving: Semantic Enhancement and Channel-Adaptive Design
链接:https://arxiv.org/abs/2607.01772
作者:Wen Wang,Yaping Sun,Yejun He,Hao Chen,Zhiyong Chen,Xiaodong Xu,Nan Ma,Shuguang Cui
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:combining dense visual, Large Language Model-centric, robust autonomous driving, Vision-radar fusion, Language Model-centric Semantic-layer
备注: 6 pages, 4 figures. Accepted by 2026 IEEE 37th International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC)
点击查看摘要
Abstract:Vision-radar fusion is central to robust autonomous driving, combining dense visual semantics with precise range and velocity measurements from radar. However, real-world fusion quality is fundamentally challenged by dynamically varying input quality, stemming from occlusion, adverse weather, and channel noise. To address this, we re-frame the problem from static data fusion to channel-aware semantic reasoning and propose a Large Language Model-centric Semantic-layer Channel-aware Integrated Perception (LM-SCIP) framework. It places a Large Language Model (LLM) as a central reasoning core to fuse a local visual stream with a quality-varying external radar stream used to cover perception-blind spots. Concretely, LM-SCIP couples a hierarchical radar-vision encoder with a Channel-Adaptive Semantic Module (CASM) that maps link indicators into a "Channel Prompt" to dynamically gate external radar features. A parameter-efficient, LoRA-tuned LLM, in conjunction with a heterogeneous Mixture-of-Experts (H-MoE), then arbitrates between local visual cues and the channel-conditioned radar context. Finally, a decoupled multi-task decoder outputs localization, trajectory forecasting, and image reconstruction. Experiments on nuScenes and VIRAT validate our approach. On nuScenes, under a controlled toggle of radar input, LM-SCIP reduces localization RMSE by 40.0% versus a vision-only baseline. On VIRAT, the model attains a 0.214m localization RMSE and 0.179m minFDE (k=1). These results reveal that the proposed LM-SCIP enables a robust vision-dominant fallback at low SNR and synergistic fusion at high SNR.
102. 【2607.01768】JointHOI: Jointly Generating Contact Maps Enhances Hand Object Interaction Generation
链接:https://arxiv.org/abs/2607.01768
作者:Mingyeong Song,Jungbin Cho,Jisoo Kim,Ananya Bal,Kartik Sharma,Youngjae Yu,Laszlo A. Jeni,Junhyug Noh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interactions remains challenging, physically plausible interactions, plausible interactions remains, producing physically plausible, hand object interaction
备注: 18 pages
点击查看摘要
Abstract:Text driven hand object interaction (HOI) generation is gaining attention for immersive applications and robotics, yet producing physically plausible interactions remains challenging. Even when individual motions appear natural, small contact errors can cause conspicuous artifacts such as floating and interpenetration. Prior methods mitigate these issues using explicit contact cues or implicit grasp priors, but typically rely on multi stage pipelines and fail to model temporally evolving contact. We present JointHOI, a single stage diffusion framework that jointly generates 3D hand object motion and dynamic, distance based contact maps from text. By treating contact as an auxiliary inner modality, joint generation enables the model to learn contact motion coupling during training. At inference, contact guided sampling enforces consistency between generated contact maps and motion implied geometry, improving temporal stability and reducing penetration and floating. Experiments on GRAB and ARCTIC demonstrate consistent improvements in text adherence and physical plausibility over prior methods.
103. 【2607.01759】ProCal: Inference-Time Proposal Calibration for Open-Vocabulary Object Detection
链接:https://arxiv.org/abs/2607.01759
作者:Jae-Ryung Hong,Ho-Joong Kim,Seong-Whan Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:dur ing training, Open-vocabulary object detection, ing training, Open-vocabulary object, object detection aims
备注:
点击查看摘要
Abstract:Open-vocabulary object detection aims to localize and classify objects beyond the fixed set of categories seen dur ing training. Recent open-vocabulary object detection methods improve localization and classification for unseen categories by leveraging a frozen VLM as a detector backbone. However, VLM classification score lacks recognizing position and scale of the object in an image. We observe that pretrained VLMs en able to classify foreground and background regions. According to this observation, we propose a simple inference-time Pro posal Calibration (ProCal) that improves localization quality of the classification score. ProCal computes a proposal prior by combining two scores: localization-aware foreground score and background-aware suppression score. Localization-aware foreground score captures whether a proposal contains an object area. Background-aware suppression score measures the extent to which the proposal resembles background. We analyze that ProCal suppresses false novel activation on background proposals and consistently ranks true novel proposals above background and partial novel proposals. Applied to CLIPSelf ViT-L/14, ProCal improves APr +2.5 on OV-LVIS. The analyses show that proposal-level localization-aware reranking effects to mitigate ranking miscalibration for novel objects.
104. 【2607.01757】DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM
链接:https://arxiv.org/abs/2607.01757
作者:Shoon Kit Lim,Melissa Jia Ying Chong,Ting Yang Ling
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:coupled visual-inertial SLAM, Deep-learning features excel, tightly coupled visual-inertial, visual-inertial SLAM, Deep-learning features
备注:
点击查看摘要
Abstract:Deep-learning features excel in visual matching, yet their practical value in tightly coupled visual-inertial SLAM (VI-SLAM) remains insufficiently characterized. We present DL-VINS-Factory, a unified framework that integrates learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with either Lucas--Kanade (LK) optical-flow tracking or LightGlue (LG) descriptor matching. All front-ends share a sliding-window Ceres back-end, with optional AnyLoc DINOv2-VLAD loop closure, and 4-DoF pose-graph optimization. We benchmark the system across the four datasets covering indoor, unstructured outdoor, aggressive-motion, and visually degraded conditions. Results show that learned front-ends are viable for real-time embedded VI-SLAM, but are not universally superior to classical tracking. Relative to the corresponding GFTT+LK baseline, ALIKED+LG reduces EuRoC ATE by $5\%$ in monocular odometry and by $7\%$ in stereo with loop-closure. On NTU-VIRAL, where aggressive aerial motion increases inter-frame viewpoint change, ALIKED+LG stereo reduces loop-closed ATE by $12\%$. In Botanic Garden dataset, optical-flow tracking remains preferable, but learned keypoints still improve over the baseline GFTT, in which SuperPoint+LK reduces grayscale camera ATE by $29\%$, while RaCo+LK reduces RGB camera ATE by $38\%$. On SubT-MRS, learned front-ends display varying degree of improvement based on individual cases. With TensorRT acceleration on a Jetson AGX Orin, all valid configurations run in real time between $29$--$47$ FPS in monocular mode and $18$--$33$ FPS in stereo mode for the EuRoC and NTU-VIRAL datasets. AnyLoc further confirms roughly $2$--$7\times$ more valid loops than BRIEF+DBoW2. The implementation is open-sourced at this https URL.
105. 【2607.01756】ProSAC-CT: Progressive Spectral-Anatomical Co-Guided Multi-Stage Diffusion Model for Low-Dose CT Denoising
链接:https://arxiv.org/abs/2607.01756
作者:Xuepeng Liu,Zetong Liu,Renyiming Li,Yan Li,Ruiyu Li,Ruili Li,Jiayi Ding,Eichi Takaya
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stronger quantum noise, weaken low-contrast structures, reduces radiation exposure, introduces stronger quantum, Low-dose computed tomography
备注: 14 pages, 8 figures, 3 tables
点击查看摘要
Abstract:Low-dose computed tomography (LDCT) reduces radiation exposure but introduces stronger quantum noise, streak artifacts, and local texture degradation, which can obscure anatomical boundaries and weaken low-contrast structures. Diffusion models are promising for LDCT denoising by progressively recovering normal-dose CT (NDCT) images from degraded LDCT inputs, but existing methods often suffer from insufficient anatomical guidance, uncertain frequency-dependent recovery, and uniform reverse-process modeling. We propose ProSAC-CT, a progressive spectral-anatomical co-guided multi-stage diffusion model for image-domain LDCT denoising. ProSAC-CT integrates an anatomical-prior-guided conditioning (APGC) module, a residual frequency-domain decoupling stage (RFDDS), and a time-step-decoupling denoising decoder (TD3). APGC extracts LDCT-derived structural guidance, RFDDS enhances frequency-aware representations, and TD3 assigns them to different reverse-diffusion stages for anatomical stabilization, boundary refinement, and fine-detail recovery. Experiments on four LDCT degradation benchmarks show that ProSAC-CT improves image fidelity, structural similarity, perceptual quality, and information preservation over representative methods while better preserving boundary-sensitive anatomical details. Downstream anatomical-region classification on Mayo-2020 further indicates that ProSAC-CT retains task-relevant anatomical information, supporting its practical use for low-dose CT denoising.
106. 【2607.01753】he Turning Point of 3D Plant Phenotyping: 3D Foundation Models Enable Minute-to-Second Cross-Crop Reconstruction and Beyond
链接:https://arxiv.org/abs/2607.01753
作者:Hanyue Jia,Wei Zhou,Wenbo Zhou,Yanan Li,Hao Lu,Tingting Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:low throughput due, extensive multi-view imaging, low throughput, throughput due, additional cost
备注: 39 pages, 6 figures, 3 tables
点击查看摘要
Abstract:3D plant phenotyping is notoriously known to be procedure-complicated and of low throughput due to the extensive multi-view imaging, the fragile 3D reconstruction pipeline, and the additional cost from reconstructed geometry to phenotypic extraction. These limitations are further amplified in low-cost data acquisition, where smartphone videos or sparsely sampled multi-view images provide limited view overlap and self-occlusion. In this work, we show that the conventional 3D plant phenotyping pipeline could be streamlined and significantly accelerated with 3D Foundation Models (3DFMs), and particularly, present one of the first cross-crop 3D phenotyping frameworks powered by 3DFMs. The framework replaces COLMAP-style sparse initialization with 3DFM-based feed-forward geometric recovery, combines geometry-constrained 3D Gaussian Splatting for dense reconstruction, enables few-view reconstruction through iterative view synthesis and refinement, and converts reconstructed geometry into measurable organs through 2D-to-3D semantic transfer, metric scale recovery, and organ instance separation. We further construct a cross-crop dataset with smartphone-based image acquisition, diverse plant morphologies, and manual annotations for segmentation and phenotypic evaluation. Experiments across 26 plant sequences show that 3D Foundation Models reduce the average reconstruction time from 6.52 minutes to 1.58 seconds while maintaining high reconstruction quality and phenotyping accuracy. These results suggest a fresh technical route for high-throughput 3D plant phenotyping, from low-cost image acquisition to fast reconstruction, perception, scale recovery, and phenotypic measurement.
107. 【2607.01751】MedStreamBench: A Time-Aware Benchmark for Streaming and Proactive Medical Video Understanding
链接:https://arxiv.org/abs/2607.01751
作者:Yuan Wang,Shujian Gao,Songtao Jiang,Zhengyu Hu,Zuozhu Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing medical video, Existing medical, produces the correct, rarely assess, video benchmarks primarily
备注: 10 Pages, 5 Figures
点击查看摘要
Abstract:Existing medical video benchmarks primarily evaluate whether a model produces the correct answer, but rarely assess whether it answers at the right time. In real clinical settings, AI systems must decide not only what to predict, but also when to answer, defer judgment, or proactively raise alerts. This creates a critical gap between benchmark evaluation and deployment requirements. We present MedStreamBench, a benchmark for time-aware medical video understanding. MedStreamBench integrates 22 medical datasets and 5,419 QA instances across four temporal settings: retrospective, present, future, and proactive. Unlike conventional benchmarks that assume full-video access, MedStreamBench restricts models to temporally bounded evidence windows and supports both single-turn and streaming evaluation. We further introduce a proactive monitoring setting that requires models to determine whether and when clinically relevant alerts should be triggered. Beyond answer correctness, MedStreamBench evaluates temporal behavior through responsiveness and post-evidence stability. Experiments on leading general-purpose and medical vision-language models reveal a substantial gap between offline recognition and temporally grounded decision-making, with performance dropping markedly in streaming and proactive settings. Our benchmark is available at this https URL.
108. 【2607.01748】RTE-FM-Dehazer: Radiative Transfer Equation Inspired Flow Matching for Real-World Image Dehazing
链接:https://arxiv.org/abs/2607.01748
作者:Chenfeng Wei,Chun Wang,Boyang Zhao,Si Zuo,Shenhong Wang,Chenguang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Single-image dehazing aims, translation task, faces two limitations, Atmospheric Scattering Model, Single-image dehazing
备注:
点击查看摘要
Abstract:Single-image dehazing aims to recover a clear scene from a hazy image and is generally formulated as an image-to-image translation task; however, it faces two limitations. Its performance depends heavily on the haze-formation priors embedded in the model. Prevailing methods adopt the Atmospheric Scattering Model (ASM), whose assumptions of single scattering and homogeneous media are often violated, leading to residual haze and color drift. Moreover, large-scale real hazy/clear pairs are impractical to collect, and existing synthesis approaches fail to reproduce the full complexity of natural haze. To address these issues, we present RTE-FM-Dehazer, a novel dehazing approach, together with a scalable data pipeline. Unlike the ASM, the Radiative Transfer Equation (RTE) jointly accounts for both scattering and absorption, naturally accommodating the non-homogeneous, multiple-scattering media that characterize real hazy scenes. Motivated by the structural similarity between the RTE diffusion-absorption term and the ODE in flow matching, we introduce a diffusion-absorption regularizer derived from a reduced RTE, to steer the flow matching trajectory at each step. Next, leveraging modern vision-language models, we build an automated pipeline and release P-HAZE, a dataset of 50000 realistic hazy/clear pairs. Extensive evaluations demonstrate that RTE-FM-Dehazer, trained solely on P-HAZE, effectively eliminates artifacts like residual haze and color drift, exhibits strong cross-domain generalization, and achieves leading results on five real-world dehazing benchmarks.
109. 【2607.01743】InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation
链接:https://arxiv.org/abs/2607.01743
作者:Qing Yu,Kent Fujiwara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Text-conditioned human interaction, Text-conditioned human, tightly coupled coordination, long-range temporal causality, capture both long-range
备注: Accepted to ECCV 2026, Project website: [this https URL](https://yu1ut.com/InterCMDM-HP/)
点击查看摘要
Abstract:Text-conditioned human interaction generation must capture both long-range temporal causality within each individual and tightly coupled coordination between partners. Existing interaction diffusion models typically denoise full sequences using bidirectional attention, which obscures causality and hinders streaming and long-horizon generation. Autoregressive alternatives enforce causality but often suffer from temporal drift, leading to coordination degradation and unstable interaction dynamics over time. We propose InterCMDM, a block-causal latent diffusion framework for autoregressive two-person interaction generation. InterCMDM introduces a Dual-Stream Causal Diffusion Transformer that maintains separate causal streams for each person while modeling inter-person dependencies via unified dual-stream attention with multi-task attention masks. These masks unify interaction modeling within a single attention mechanism and support diverse coordination behaviors, including simultaneous actions, reactive responses, leader-follower dynamics, and independent motion. By training a single model across these mask configurations as a form of data augmentation, InterCMDM enables controllable interaction generation by simply selecting the desired attention mask at inference time. Finally, a block-wise diffusion objective enables stable latent rollout over long sequences without repeated decode-encode cycles. InterCMDM achieves state-of-the-art performance on InterHuman and Inter-X, improving text-motion alignment, realism, and long-horizon continuity.
110. 【2607.01737】ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA
链接:https://arxiv.org/abs/2607.01737
作者:Minkuk Kim,Suyong Yun,Young Tae Kim,Jinyoung Moon,Jinwoo Choi,Seong Tae Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent multimodal large, input token budgets, multimodal large language, fixed input token, Recent multimodal
备注: Accepted at ECCV 2026
点击查看摘要
Abstract:Recent multimodal large language models (MLLMs) have substantially advanced video understanding, yet long-form video QA remains challenging under fixed input token budgets, where uniform sampling can be inefficient for evidence localization. We propose ReQuest , an uncertainty-driven, question-adaptive keyframe selection pipeline that aligns question intent with relevant video content through selective computation. ReQuest integrates (i) a lightweight question-aware selector distilled from MLLM-generated supervision, (ii) Re-thinking Routing that triggers additional inference only when the model is uncertain with a length-adaptive criterion, and (iii) uncertainty-guided adaptive non-maximum suppression that selects temporally diverse frames while adjusting spacing based on question difficulty. As a plug-andplay method, ReQuest improves long-video QA without modifying or fine-tuning the underlying MLLM. Experiments on Video-MME, MLVU, and LongVideoBench demonstrate consistent accuracy gains with competitive computational cost, with particularly strong improvements in medium and long video regimes.
111. 【2607.01728】Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing
链接:https://arxiv.org/abs/2607.01728
作者:Licheng Zhang,Bach Le,Pengtao Zhao,Naveed Akhtar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:standard quality assurance, quality assurance step, modern software release, Visual regression testing, software release pipelines
备注:
点击查看摘要
Abstract:Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.
112. 【2607.01708】Consistent Scene Understanding in 3D Gaussian Splatting via Multi-Cue Mask Refinement
链接:https://arxiv.org/abs/2607.01708
作者:Hyunjoon Park,Donghyeon Cho
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reliable instance-level scene, instance-level scene understanding, Reliable instance-level, instance-level scene, scene understanding
备注: Accepted at ICPR 2026
点击查看摘要
Abstract:Reliable instance-level scene understanding is a fundamental prerequisite for object-level interactions and high-fidelity 3D representations. While current methods often leverage 2D foundation segmentation models to obtain these priors, their 2D-centric design typically yields fragmented masks and inconsistent predictions across different views. To address these issues, we propose a novel framework that produces consistent 2D instance masks to guide the optimization of 3D Gaussian Splatting (3DGS) feature fields. Our framework consists of three main stages. (1) Multi-Cue Extraction that generates synergistic semantic, geometric, and structural priors from input images. (2) Multi-Cue-Guided Mask Merging process that consolidates fragmented masks using a composite merge score derived from semantic, depth, and edge cues. (3) Cross-View Mask Matching that establishes globally consistent identity assignments across all viewpoints. By transforming viewpoint-specific segments into coherent 3D primitives, our approach enables stable 3D instance segmentation and effective downstream editing tasks. Experiments demonstrate that our method significantly improves cross-view consistency and segmentation stability over existing baselines while maintaining high-fidelity photometric reconstruction.
113. 【2607.01707】LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression
链接:https://arxiv.org/abs/2607.01707
作者:Bowen Yuan,Zijian Wang,Yadan Luo,Shijie Wang,Zi Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large vision-language models, Large vision-language, attention progressively drifts, visual, ability but suffer
备注: The 19th European Conference on Computer Vision (ECCV 2026)
点击查看摘要
Abstract:Large vision-language models (LVLMs) exhibit strong reasoning ability but suffer from visual forgetting during long-horizon decoding, where attention progressively drifts away from visual evidence. Existing methods largely treat this issue as a late-stage attention decay problem or attempt to mitigate it through heuristic reminders or post-hoc attention lifting. Through systematic empirical analysis, we find that performance degradation under visual forgetting is largely driven by two overlooked factors: early-stage attention decay disrupts evidence acquisition, and attention concentration on a subset of task-irrelevant visual sink tokens. Motivated by these insights, we propose LASER, a post-training framework that regulates both the visual attention trajectory and intra-visual token attention distribution during reasoning. Technically, LASER introduces two complementary rewards: a Visual Grounding Reward, which encourages the model to maintain attention on semantically salient visual tokens throughout decoding, and a Sink Suppression Reward, which penalizes excessive attention concentration on visual sink tokens. Together, these rewards preserve early-stage grounding while preventing attention collapse onto uninformative regions. Extensive experiments on eight benchmark datasets demonstrate that LASER consistently outperforms strong baselines, validating attention-aware training as an effective remedy for visual forgetting.
114. 【2607.01698】Structure-Aware Gaussian Splatting for Large-Scale Scene Reconstruction
链接:https://arxiv.org/abs/2607.01698
作者:Weiyi Xue,Fan Lu,Chi Zhang,Tianhang Wang,Sanqing Qu,Zehan Zheng,Boyuan Zheng,Junqiao Zhao,Guang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable potential, Splatting has demonstrated, Gaussian Splatting, view synthesis, demonstrated remarkable
备注:
点击查看摘要
Abstract:3D Gaussian Splatting has demonstrated remarkable potential in novel view synthesis. In contrast to small-scale scenes, large-scale scenes inevitably contain sparsely observed regions with excessively sparse initial points. In this case, supervising Gaussians initialized from low-frequency sparse points with high-frequency images often induces uncontrolled densification and redundant primitives, degrading both efficiency and quality. Intuitively, this issue can be mitigated with scheduling strategies, which can be categorized into two paradigms: modulating target signal frequency via densification and modulating sampling frequency via image resolution. However, previous scheduling strategies are primarily hardcoded, failing to perceive the convergence behavior of scene frequency. To address this, we reframe the scene reconstruction problem from the perspective of signal structure recovery and propose SIG, a novel scheduler that synchronizes image supervision with Gaussian frequencies. Specifically, we derive the average sampling frequency and bandwidth of 3D representations, and then regulate the training image resolution and the Gaussian densification process based on scene frequency convergence. Furthermore, we introduce Sphere-Constrained Gaussians, which leverage the spatial prior of initialized point clouds to control Gaussian optimization. Our framework enables frequency-consistent, geometry-aware, and floater-free training, achieving state-of-the-art performance by a substantial margin in both efficiency and rendering quality in large-scale scenes. The code is available at: this https URL
115. 【2607.01677】ICDepth: Taming Video Diffusion Models for Video Depth Estimation via In-Context Conditioning
链接:https://arxiv.org/abs/2607.01677
作者:Xuanhua He,Jiaxin Xie,Mingzhe Zheng,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular video depth, existing methods struggle, requires temporal consistency, video depth estimation, estimation requires temporal
备注: Accepted to ECCV 2026. Project page: [this https URL](https://xuanhuahe.github.io/ICDepth/)
点击查看摘要
Abstract:Monocular video depth estimation requires temporal consistency, geometric accuracy, and generalization across diverse scenarios, yet existing methods struggle to achieve all three simultaneously. Discriminative models excel at per-frame accuracy but suffer from temporal drift due to limited context windows, while generative methods improve consistency and generalization at the cost of extensive training data (10M+ samples) and lack of geometric precision. In response to these issues, we introduce \textbf{ICDepth}, a framework that adapts pre-trained text-to-video diffusion transformers for video depth estimation via In-Context Conditioning (ICC), leveraging their rich spatial-temporal priors. To address key challenges in transferring ICC from generation to dense prediction, we propose: (1)~\textbf{SAND-Attention}, which ensures precise spatial-temporal alignment via shared RoPE and enforces unidirectional attention to prevent noise contamination; (2)~\textbf{SRFM}, which injects DINOv2 semantic and resolution priors to enhance geometric precision. ICDepth achieves state-of-the-art results on multiple benchmarks with remarkable data efficiency, trained on only 0.8M frames ($6$--$13\times$ less than competing generative methods), while demonstrating strong zero-shot generalization to diverse domains.
116. 【2607.01675】HistoSeg++: Delving deeper with attention and multiscale feature fusion for biomarker segmentation
链接:https://arxiv.org/abs/2607.01675
作者:Saad Wazir,Rao Faizan,Daeyoung Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image analysis, medical images, medical image, image analysis, biomedical application
备注: Published in the Proceedings of ICBBE 2025. The Version of Record is available at [this https URL](https://doi.org/10.1145/3794209.3794211)
点击查看摘要
Abstract:Segmentation of biomarkers in medical images is frequently viewed as a first step towards medical image analysis in any bioinformatics or biomedical application. Despite progress, existing methods still struggle to capture information at multiple scales and to perform upsampling effectively across different datasets. These shortcomings often result in suboptimal generalization capabilities. Recently, architectures belonging to the Nested-UNet family excel in capturing multiscale contextual information and upsample them effectively. In this work, We propose a novel Nested-UNet architecture that effectively captures multi-scale contextual information. It includes inner and outer attention units to enhance focus during upsampling, along with channel-wise feature recalibration using squeeze-and-excitation modules, leading to improved segmentation performance. Additionally, the architecture integrates an edge-aware loss to emphasize boundary accuracy by assigning greater importance to edge regions. Tested extensively on three publicly available benchmark datasets. Our method demonstrates a generalization performance superior to existing Nested-UNet methods. Code: this https URL
117. 【2607.01667】mporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning
链接:https://arxiv.org/abs/2607.01667
作者:Chen Zhao,Jiajun Ma,Qilong Huang,Tiehan Fan,Hongyu Li,Zhuoliang Kang,Xiaoming Wei,Jian Yang,Ying Tai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, achieving precise temporal
备注: ECCV 2026
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.
118. 【2607.01663】Unified Panoramic-Gaussian Representation for Monocular 4D Scene Synthesis
链接:https://arxiv.org/abs/2607.01663
作者:Yuankun Yang,Yi Wei,Wenyang Zhou,Li Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made significant progress, recent years, made significant, significant progress, progress in recent
备注: Accepted at ECCV 2026
点击查看摘要
Abstract:4D scene synthesis from monocular videos has made significant progress in recent years. However, existing methods are typically constrained by view interpolation. As a result, they struggle to infer unseen regions beyond the observed views. In this paper, we reformulate the task as 4D scene synthesis with unseen regions, which extends beyond traditional interpolation settings. Camera-conditioned video generation enables unseen region synthesis by guiding generation along specified cameras. However, these methods lack explicit 3D priors and are optimized with random camera trajectories. This design leads to severe inconsistencies under large trajectory deviations. To address this limitation, we build a unified training and inference framework with panoramic trajectory guidance. While this design improves cross-view consistency, the panoramic representation alone fails to model dynamic content effectively. Object motion in panoramic space introduces scale and shape distortions. To address this, we propose PanoGaussian, a unified Panoramic-Gaussian representation that distills the panoramic representation into an explicit dynamic Gaussian representation to capture dynamic physical priors of the 4D scene. Experiments demonstrate that PanoGaussian achieves consistent 4D scene synthesis even under large viewpoint variations.
119. 【2607.01658】aching Vision-Language-Action Models What to See and Where to Look
链接:https://arxiv.org/abs/2607.01658
作者:Yuguang Yang,Canyu Chen,Zhewen Tan,Yizhi Wang,Zichao Feng,Chunyang Liu,Kehua Sheng,Juan Zhang,Linlin Yang,Baochang Zhang,Yan Wang,Bo Zhang,Xianbin Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:models have emerged, promising paradigm, autonomous driving, Abstract, existing VLAs' training
备注: The paper has been accepted by ECCV 2026
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing VLAs' training relies heavily on text-centric visual question answering and chain-of-thought reasoning data, which emphasizes linguistic reasoning rather than action-grounded planning. As a result, the learned representations capture semantic knowledge but lack spatial dependencies crucial for reliable trajectory prediction. We propose DriveTeach-VLA, a framework that explicitly teaches VLAs what to see and where to look. Driving-aware Vision Distillation (DVD) injects driving-specific perceptual priors into the vision encoder, while 2D Trajectory-Guided Prompts (2D-TGP) provide spatial conditioning aligned with feasible driving trajectories. Together, they form a vision-guided learning pipeline: what to see (DVD pretraining) - where to look (TGP-guided SFT) - how to act (TGP-guided GRPO). DriveTeach-VLA achieves the state-of-the-art performance on NAVSIM and nuScenes. Our code is available at: this https URL.
120. 【2607.01657】Domain Generalization via Text-Anchored Information Bottleneck
链接:https://arxiv.org/abs/2607.01657
作者:Eunyi Lyou,Yunjeong Choi,Junho Lee,Joonseok Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fail when deployed, Visual recognition models, recognition models, Visual recognition, Visual
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Visual recognition models often fail when deployed in new environments. Domain Generalization (DG) addresses this by learning representations that remain invariant to environment-specific variations. Recent approaches increasingly rely on large vision-language models, assuming that preserving their expressive visual representations improves robustness. However, we show that such visual expressiveness can instead propagate spurious cues that tie representations to the training environments, hindering invariant learning. We therefore discard visual guidance and instead treat the language embedding space as the primary source of domain invariance, naturally acting as an information bottleneck that preserves core semantics while suppressing domain-specific variations. Extensive experiments across diverse backbones exhibit state-of-the-art performance and further analyze what makes guidance effective for robust generalization. These findings shift the focus of DG from improving representations to designing supervision that enforces invariance.
121. 【2607.01654】Plug-and-Play Volumetric Reconstruction for Compressive Sensing Light-Sheet Microscopy
链接:https://arxiv.org/abs/2607.01654
作者:Jianqing Jia,Yi Gong,Xinyuan Zhang,Jichen Chai,Yichen Ding,Yifei Lou
类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:sensing light-sheet microscopy, compressive sensing light-sheet, encoding multiple axial, multiple axial planes, fast volumetric imaging
备注:
点击查看摘要
Abstract:We investigate volumetric reconstruction for compressive sensing light-sheet microscopy (CS-LSM), where fast volumetric imaging is achieved by encoding multiple axial planes into each camera exposure. To recover the underlying volume from highly multiplexed measurements, we propose a plug-and-play (PnP) framework that flexibly incorporates any user-specified denoiser into the reconstruction process. Building on a slice-based formulation, we further introduce an axial-coupled model that exploits correlations between adjacent slices to improve volumetric continuity. For efficient computation, we derive a Woodbury-based update for the data-consistency step in both the slice-based and axial-coupled formulations, and employ a Gauss-Seidel sweep for the denoising step in the axial-coupled model. Under a weakly convex regularization assumption, we establish subsequential convergence of the proposed algorithm. Experiments on synthetic and real zebrafish-heart data demonstrate that the proposed framework successfully recovers cellular structures from compressed measurements, and provide practical insights into the comparative performance of commonly used denoisers within the PnP framework under the CS-LSM setup.
122. 【2607.01648】Boosting Ultrasound Image Classification via Attribute-Guided Dual-Branch Framework
链接:https://arxiv.org/abs/2607.01648
作者:Bo Zhao,Yapeng Li,Juhua Liu,Bo Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer-aided diagnosis, essential for computer-aided, Abstract, ultrasound classification, limits clinical adoption
备注: accepted by MICCAI 2026
点击查看摘要
Abstract:Ultrasound image classification is essential for computer-aided diagnosis. However, current methods often neglect clinical priors, leading to poor generalization in challenging scenarios and a lack of interpretability that limits clinical adoption. To address these issues, we aim to develop a medical-prior module that can be seamlessly integrated into existing pipelines to enhance both diagnostic performance and interpretability. In this paper, we propose an attribute-guided dual-branch framework for ultrasound classification that introduces domain-agnostic medical attribute priors, improving generalization while offering interpretable evidence. Specifically, a baseline branch follows conventional architectures and predicts image categories via a fully connected classifier. An attribute-guided branch injects domain-agnostic attributes as priors and produces human-interpretable decision cues. Finally, an adaptive decision module fuses the two branches in a data-dependent manner to yield the final prediction. Experiments across diverse ultrasound classification tasks demonstrate that our approach can be integrated into multiple backbones and state-of-the-art methods with low overhead, consistently improving accuracy and interpretability. Code is available at: this https URL.
123. 【2607.01642】Multi-Resolution Flow Matching: Training-Free Diffusion Acceleration via Staged Sampling
链接:https://arxiv.org/abs/2607.01642
作者:Xingyu Zheng,Xianglong Liu,Yifu Ding,Weilun Feng,Junqing Lin,Jinyang Guo,Haotong Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduce inference time, feature caching, system-level optimization, Hardware-agnostic strategies, reduce inference
备注: The code is available at [this https URL](https://github.com/Xingyu-Zheng/MrFlow)
点击查看摘要
Abstract:Hardware-agnostic strategies for accelerating text-to-image diffusion, such as timestep distillation and feature caching, can reduce inference time without custom kernels or system-level optimization. Among them, multi-resolution generation strategies have recently received broad attention, attaining more than 5x speedup without any training. However, the design of performing upsampling in the latent space, together with the selective modification of partial regions, causes these methods to exhibit noticeable blurring or artifacts. To this end, we propose MrFlow, a training-free multi-resolution acceleration strategy for pretrained flow-matching models built upon a staged low-to-high-resolution pipeline. MrFlow first rapidly generates the main structure at low resolution, then performs super-resolution in the pixel space using a lightweight pretrained GAN-based model, subsequently injects low-strength noise to enable high-frequency resampling, and finally refines the details at high resolution. Quantitative and qualitative results on FLUX.1-dev and Qwen-Image show that MrFlow exploits the quadratic token reduction and reduced step requirement of low-resolution sampling to achieve 10x end-to-end acceleration while keeping OneIG within a 1% gap relative to that before acceleration, significantly surpassing other training-free acceleration strategies, and requiring no training or runtime dynamic identification whatsoever. MrFlow can further be directly combined orthogonally with pre-trained timestep distillation strategies, achieving even higher generation acceleration of up to 25x.
124. 【2607.01633】Bridging 3D Gaussians and Semantic Occupancy for Comprehensive Open-Vocabulary Scene Understanding from Unposed Images
链接:https://arxiv.org/abs/2607.01633
作者:Hu Zhu,Bohan Li,Xianda Guo,Yanlun Peng,Zheng Zhu,Xin Jin,Wenjun Zeng,Chang Wen Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unposed images requires, external camera calibration, scene understanding, understanding from sparse, unposed images
备注: Hu Zhu, Bohan Li, and Xianda Guo contributed equally. Corresponding author: Wenjun Zeng
点击查看摘要
Abstract:Comprehensive 3D scene understanding from sparse, unposed images requires a model to recover renderable geometry, open-vocabulary semantics, and free/occupied 3D space without relying on external camera calibration. Recent feed-forward Gaussian methods improve pose-free reconstruction and semantic rendering, but their Gaussian primitives are mainly optimized through image-space objectives and remain weakly constrained in unobserved regions. We propose \textit{COVScene}, a pose-free semantic Gaussian framework that couples renderable Gaussian primitives with a dense semantic occupancy field through differentiable volumetric lifting. Instead of converting Gaussians to voxels only at evaluation time, COVScene lifts the predicted semantic Gaussians inside the training computation graph, so volumetric regularization provides gradients to Gaussian opacity, geometry, and semantic features. The framework combines a semantic-aware Geometry Transformer, multi-task Gaussian decoding, geometric foundation distillation, and occupancy entropy regularization to support novel view synthesis, open-vocabulary semantic querying, and semantic occupancy prediction within a single representation. Experiments on ScanNet and ScanNet++ show that COVScene maintains competitive rendering quality, improves open-vocabulary segmentation, and achieves stronger semantic occupancy prediction than the self-supervised baseline without direct voxel-level supervision.
125. 【2607.01630】DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning
链接:https://arxiv.org/abs/2607.01630
作者:Bingchen Huang,Yifu Chen,Zhiling Wang,Yuanchao Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sufficiently preserve task-agnostic, preserve task-agnostic shared, protect task-specific knowledge, Representation Dynamic Network, growing dedicated tokens
备注: 10 pages, IEEEtran journal format. Preprint submitted to IEEE Transactions on Multimedia
点击查看摘要
Abstract:Dynamic expansion methods for class-incremental learning (CIL) protect task-specific knowledge by growing dedicated tokens or subnetworks, yet our analyses suggest that classification supervision alone does not sufficiently preserve task-agnostic shared backbone representations over long incremental sequences. We identify two intertwined challenges: cross-task confusion from sequential training on predominantly current-task data, which biases decision boundaries toward recent tasks; and under-optimized shared representations in the backbone that cap long-term discriminability as tasks accumulate. We propose the Decoupled Representation Dynamic Network (DRDN), which addresses these challenges via two orthogonal mechanisms. For shared backbone representations, DRDN continuously applies masked image modeling (MIM) at every incremental step, with reconstruction gradients routed exclusively through the backbone, encouraging it to retain general visual structure beyond class-discriminative cues. For task-specific discrimination, DRDN employs hierarchical task token expansion across all transformer layers, with a modified per-task attention rule that reduces inter-task interference. We support this design with accuracy degradation analysis and cross-task confusion rate measurements. In the from-scratch ViT CIL setting (no external pretraining), DRDN consistently improves over strong token-expansion baselines with comparable backbone scale. On CIFAR100-B0 (10 steps), DRDN achieves 77.19% average accuracy, outperforming DKT by 1.36 points and DyTox by 3.53 points, with an advantage that grows at longer incremental sequences. Multi-seed validation confirms stability (+/-0.31%). The MIM decoder is active only during training, adding no inference-time parameters or computation.
Comments:
10 pages, IEEEtran journal format. Preprint submitted to IEEE Transactions on Multimedia
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2607.01630 [cs.CV]
(or
arXiv:2607.01630v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2607.01630
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
126. 【2607.01628】Online Segment 3D Gaussians via Launching Virtual Drones
链接:https://arxiv.org/abs/2607.01628
作者:Liwei Liao,Rongjie Wang,Ronggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-time rendering capability, Gaussian Splatting, real-time rendering, offers a compelling, compelling opportunity
备注:
点击查看摘要
Abstract:Interactive segmentation of 3D Gaussians offers a compelling opportunity for real-time manipulation of 3D scenes, thanks to the real-time rendering capability of 3D Gaussian Splatting (3DGS). However, existing methods require a time-consuming per-scene setup - typically tens of seconds or even minutes - before interactive segmentation can begin on a raw 3DGS scene. This setup involves multi-view mask preparation, mask lifting, and feature distillation, creating a major bottleneck for online applications. To address this limitation, we aim to completely eliminate the setup stage for interactive 3DGS segmentation while keeping the segmentation time practical (under 1 second). In this work, we present SAGO (Segment Any Gaussians Online), a novel setup-free framework for interactive 3DGS segmentation. By introducing virtual drones, our method reframes the 3D segmentation problem as an online Next-Best-View (NBV) planning task formulated within a Markov process. Extensive experiments demonstrate that SAGO can extract clean 3D assets directly from 3D Gaussians with sub-second latency, thereby enabling a broad range of downstream applications such as object manipulation and scene editing. Moreover, our method achieves over a 50x speedup compared to the previous setup-free 3DGS segmentation frameworks.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2607.01628 [cs.CV]
(or
arXiv:2607.01628v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2607.01628
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
127. 【2607.01626】Multi-THuMBS: Multi-person Tracking of 3D Human Meshes Beyond Video Shots
链接:https://arxiv.org/abs/2607.01626
作者:Jeongwan On,Muhammad Salman Ali,Muneeb A. Khan,Sunwoo Park,Inwoong Moon,Hyung Jin Chang,Jaekwang Kim,Seong Jong Ha,Seungryul Baek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging problem, challenging problem due, severe truncation inherent, complex interactions, unconstrained environments
备注: Project page: [this https URL](https://on-jungwoan.github.io/projects/multi-thumbs/)
点击查看摘要
Abstract:Tracking multi-person 3D human meshes from in-the-wild videos is a highly challenging problem due to complex interactions, frequent occlusions, and severe truncation inherent in unconstrained environments. While recent approaches have improved robustness against these issues, they largely overlook the critical challenge prevalent in real-world footage: frequent shot changes. These abrupt transitions in camera viewpoints often cause existing methods to lose track of human identities and fail in reconstructing temporally coherent trajectories. Although several recent works have explored 3D human mesh tracking under shot changes, they are still limited to single-person scenarios, making them inadequate for real-world videos where multiple people interact and appear simultaneously. To address this limitation, we propose Multi-THuMBS (Multi-person Tracking of 3D Human Meshes Beyond Video Shots) that leverages a state-of-the-art 3D scene prior to reconstruct the two boundary frames in a single shared 3D space. Human meshes are then registered within the shared 3D space, maintaining per-person identity and motion consistency across shot changes. Extensive experiments demonstrate that our approach yields significant improvements in 3D human mesh recovery, camera pose estimation, and identity tracking, thereby ensuring high-fidelity motion reconstruction with consistent identity preservation across shots compared to previous state-of-the-art methods.
128. 【2607.01586】VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment
链接:https://arxiv.org/abs/2607.01586
作者:Guoyang Xia,Fengfa Li,Hongjin Ji,Lei Ren,Fangxiang Feng,Kun Zhan,Yan Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:advanced robotic manipulation, recently advanced robotic, paradigms remain difficult, existing models, robotic manipulation
备注:
点击查看摘要
Abstract:Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.
129. 【2607.01578】MVFusion-GS: Motion-Variance Guided Temporal Attention for High-Quality Dynamic Gaussian Splatting
链接:https://arxiv.org/abs/2607.01578
作者:Jianwei Hu,Tingxuan Huang,Hengyu Zhou,Ningna Wang,Xiaohu Guo Jinshan Lai,Bin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, enables real-time, real-time novel view, view synthesis, Gaussian
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) enables real-time novel view synthesis for static scenes. Extending it to dynamic scenes via deformation fields has recently attracted significant attention, particularly for dynamic scene reconstructionband distractor-free. However, existing deformation networks lack explicit motion awareness: they neither capture long-term motion intensity nor exploit short-term temporal coherence, leading to inaccurate foreground deformation and pseudo-static residuals in the background. We present MVFusion-GS, a method that enhances deformation networks with two complementary motion-aware mechanisms. The Motion-Variance Guided Refinement aggregates per-Gaussian deformation statistics across time to estimate motion variance and uses it to guide dynamic-static separation during deformation prediction. The MotionFormer Temporal Attention module applies Transformer self-attention over neighboring timesteps to model local motion dependencies and improve temporal consistency. Extensive experiments on both dynamic scene reconstruction and distractor-free reconstruction benchmarks demonstrate state-of-the-art performance, showing that explicit motion awareness improves both foreground motion modeling and static background reconstruction.
130. 【2607.01556】Mind the Gap: Standard 3DGS Evaluation Primarily Measures Near-Trajectory Interpolation
链接:https://arxiv.org/abs/2607.01556
作者:Gaoxiang Jia,Vikram Appia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, N-th frame, metric measures near-trajectory, measures near-trajectory interpolation, evaluation holds
备注:
点击查看摘要
Abstract:Standard MipNeRF360-style 3D Gaussian Splatting (3DGS) evaluation holds out every N-th frame -- but these frames have trained neighbors on both sides, so the metric measures near-trajectory interpolation rather than spatial generalization. We introduce a fair matched-count protocol that isolates this effect: both arms train on the same number of images and differ only in whether the holdout is spread evenly (interpolation) or forms a contiguous spatial sector (extrapolation). Our primary finding is a large, consistent interpolation-extrapolation gap of 3~12dB -- several times the differences typically reported between competing methods. The gap is robust to training noise, is in two cases large enough to flip a method ranking under multi-seed confirmation, and -- crucially -- persists across three representation families, including a non-Gaussian volumetric neural radiance field (NeRF), so it reflects spatial coverage rather than any one representation. Diagnostically, it is dominated by a diffuse/geometry-proxy component and tracks each view's angular distance to its nearest training view, a zero-cost signal that also guides capture planning; loss-side regularization yields only marginal gains. Standard holdouts remain useful for near-trajectory rendering but should not, alone, be read as evidence of spatial generalization. Prior work notes protocol sensitivity; ours is, to our knowledge, the first to combine matched-count paired holdout, cross-representation quantification, and a diagnostic analysis Table 1. We describe a spatial-holdout benchmark toolkit with standardized splits and baselines for 16 scenes, which we are preparing for public release.
131. 【2607.01555】Boosting Infrared Small Target Detection via Logit-Domain Contrast and Adaptive Shape Refinement
链接:https://arxiv.org/abs/2607.01555
作者:Handong Zeng,Zhengeng Yang,Shuai Zhang,Shikai Chen,Hongshan Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe foreground-background imbalance, remains challenging due, tiny target size, remains challenging, severe foreground-background
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Infrared small target detection (IRSTD) remains challenging due to tiny target size, low signal-to-noise ratio, severe foreground-background imbalance, and blurred boundaries in complex scenes. Existing methods usually rely on post-activation probability-domain supervision for discrimination, where weak targets and strong clutter may produce saturated and close probabilities, limiting weak-target discrimination. Meanwhile, blurred boundaries and halo-like predictions mainly stem from thermal diffusion, tiny target scale, boundary uncertainty, and insufficient explicit contour constraints. To address these issues, we propose Adaptive-Contrastive SLSIoU (AC-SLSIoU), a plug-and-play discriminative and shape-aware loss for IRSTD. Specifically, a Logit-Domain Margin Constraint (LDMC) is introduced to enlarge the response gap between targets and informative hard negatives in the logit space, thereby enhancing weak-target discrimination. Adaptive Boundary Suppression (ABS) applies scale-aware annular penalties to refine target contours and suppress halo-like overflow responses. In addition, False-Alarm Focal Loss assigns larger weights to high-probability negative samples, further penalizing persistent high-confidence false alarms. Without introducing extra inference overhead, the proposed method can be seamlessly integrated into existing detectors and consistently improves both detection accuracy and shape quality. Extensive experiments and cross-backbone evaluations demonstrate the effectiveness, robustness, and generalization ability of the proposed method for infrared small target detection.
132. 【2607.01535】Hidden-Shot: Towards One-Shot Task Generalization for Low-Level Vision Generalist Models
链接:https://arxiv.org/abs/2607.01535
作者:Shao-Jun Xia,Xianzheng Ma,Zichong Meng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intense engagement surrounding, tasks remains unverified, engagement surrounding low-level, learned tasks remains, remains unverified
备注: 34 pages, 5 figures, under submission
点击查看摘要
Abstract:Despite the intense engagement surrounding low-level vision generalist models, their effectiveness in zero/few-shot scenarios beyond learned tasks remains unverified. The primary challenge of developing an ideal generalist lies in achieving the ability to generalize from new unseen tasks, which also can be assessed by matched quantitative criteria. Existing methods have made some progress in prompt engineering but have not systematically explored this gap across a wide range of low-level visual tasks. Stimulated by the problem, we propose Hidden-Shot, an implicit prompt mechanism aimed at exploring low-level task adaptation in a vision generalist model. Specifically, the method extracts implicit visual task-based information, utilizes a global task-aware textural prompt, and selectively merges implicit information with in-task processing information to enhance one-shot capabilities in new tasks. The overall design performs direct injection in a cost-effective manner, while minimally altering the architecture of the original generalist model. Additionally, we introduce a data-driven evaluation framework termed C/U assessment to cover two basic scenarios, 3C4U (3 conventional and 4 unconventional tasks) for retraining existing models and 3C7U (3 conventional and 7 unconventional tasks) for training from scratch, as a comprehensive assessment to systematically test the generalization ability of low-level generalist models. Experiments on seven and ten datasets outperform the state-of-the-art vision generalist model, respectively verified by 3C4U and 3C7U framework. Our presented Hidden-Shot approach demonstrates superior performance on one-shot new tasks while maintaining consistent performance on existing tasks.
133. 【2607.01503】Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task
链接:https://arxiv.org/abs/2607.01503
作者:Yiqian Liu,Iuliia Kotseruba,John K. Tsotsos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:study depth perception, pictorial depth cues, perception of vision-language, disentangle vision, depth cues
备注: 15 pages, 7 figures, accepted to ECCV 2026 (30 pages, 13 figures, supplementary materials included)
点击查看摘要
Abstract:In this paper, we study depth perception of vision-language models (VLMs) to isolate the effects of pictorial depth cues and disentangle vision and language influences on model performance. To this end, we combine depth-ordering and odd-one-out psychophysical tasks: the VLMs are presented with images where one object is at different depth relative to other, otherwise identical, objects, and must determine whether the odd-one-out target is closer or farther to the observer. To create stimuli, we generate 2D views from simulated and real 3D scenes while controlling the presence of individual pictorial depth cues, enabling a fine-grained analysis of cue-level contributions. Language effects are examined by varying referring expression clarity. We also introduce a novel metric to quantify vision-vs-language sensitivities. Applying this methodology, we create the Odd-One-Out Depth (O3-D) dataset with 37K real and synthetic images and 147K image-question pairs. Evaluation of 12 open-source and commercial models on O3-D shows under-utilization of depth cues and depth-ordering accuracies between 47% and 56%, with no model above chance level. At the same time, our metric reveals strong linguistic bias in the answers. Neither chain-of-thought (CoT) nor in-context learning (ICL) significantly improves performance, suggesting that static image data alone may be insufficient for depth understanding. All code, the image generation pipeline, and the O3-D dataset are publicly released at this https URL.
134. 【2607.01499】Anti-Prompt: Image Protection against Text-Guided Image-to-Video Generation
链接:https://arxiv.org/abs/2607.01499
作者:Yeonghwan Song,Chanhui Lee,Jinsoo Park,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, raising serious copyright, privacy risks, convincing video, copyright and privacy
备注: Accepted to ECCV 2026
点击查看摘要
Abstract:Recent advances in Image-to-Video generation allow a single image to be animated into a convincing video under text guidance, raising serious copyright and privacy risks. We propose Anti-Prompt, an image protection approach that injects imperceptible perturbations into an image, inducing visible inconsistencies and structural failures in text-guided I2V generation. Our method is motivated by a simple empirical observation. When text guidance is removed from modern I2V models, generation quality degrades markedly, not only in motion realism but also in subject preservation, structural coherence, and temporal consistency. Building on this insight, Anti-Prompt exploits the model reliance on textual guidance by attenuating text-conditioned interactions during denoising while strengthening visual-only pathways. To further systematically evaluate protection effectiveness, we introduce a Video-LLM-assisted evaluation protocol that provides interpretable, frame-grounded analyses of generation artifacts and inconsistencies. Experiments on two representative I2V architectures demonstrate that our method achieves strong protection performance while improving efficiency and cross-model transferability.
135. 【2607.01469】A Cost-Aware, Paired Protocol for Auditing Dynamic Tool Synthesis in Agentic Video Question Answering
链接:https://arxiv.org/abs/2607.01469
作者:Aseel Mohamed,Rama AlHamidi,Mohamed Rayan Barhdadi,Rasul Khanbayov,Erchin Serpedin,Hasan Kurban
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Question Answering, Question Answering, systems invoke tools, Agentic Video Question, libraries are fixed
备注:
点击查看摘要
Abstract:Agentic Video Question Answering (VideoQA) systems invoke tools during inference, but their tool libraries are fixed, so recurring procedures are rebuilt from primitives on every question. Synthesizing composite tools could remove this overhead, but whether such expansion helps is hard to assess: final-answer accuracy, the standard metric, ignores inference effort, so it cannot reveal how a system shifts cost. We propose a cost-aware, paired protocol for auditing tool-augmented video agents. The protocol pairs two complete systems on the same input for each question and reports their net difference across accuracy and cost jointly. For each question, it sorts the paired outcome into one of six groups defined by joint correctness and by the change in visible tool calls, separating accuracy-preserving efficiency gains from harmful regressions. Significance is reported with McNemar's test and paired bootstrap confidence intervals. We instantiate the protocol on Dynamic-SAGE, an agentic VideoQA framework that synthesizes, validates, and persistently registers executable composite tools for reuse on unseen questions, and evaluate it against the SAGE baseline on SAGE-Bench. The audit reveals a multi-axis profile that a scalar accuracy comparison would miss: Dynamic-SAGE improves accuracy by 7.5 points (p 0.001) and reduces reasoning turns and visible tool calls by roughly 28%, while shifting rather than reducing inference cost, as token usage rises 34% and cost 26%. Gains are largest on visual and open-ended questions and neutral on verbal and multimodal ones, and residual failures concentrate on hard, open-ended questions where the pipeline does the most work. By measuring accuracy and cost jointly, the protocol shows where the pipeline-level difference is reliable and where it is not. The code is available at this https URL.
136. 【2607.01442】From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection
链接:https://arxiv.org/abs/2607.01442
作者:Gourab Das,Pavan Kumar C,Raghavendra Ramachandra
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental capability shift, minimal technical expertise, methods remain constrained, detection methods remain, Digital Injection Attacks
备注:
点击查看摘要
Abstract:Identity document forgery has undergone a fundamental capability shift: generative AI tools now enable high-fidelity document synthesis and field-level manipulation with minimal technical expertise, while detection methods remain constrained by benchmarks that do not reflect this threat. The resulting attack surface spans physical presentation, digital injection, and fully generative synthesis, introducing distinct forensic failure modes that require a unified threat model and evaluation framework. This survey provides, to our knowledge, the first unified treatment of Presentation Attacks, Digital Injection Attacks, and GenAI-driven synthesis within a single identity verification threat model. We trace detection methodologies from rule-based heuristics through forensic localisation, injection-aware pipelines, foundation models, and few-shot frameworks. A systematic audit of public datasets from 2019--2025 exposes a persistent Reality Gap between benchmark conditions and operational deployment. We further analyse large multimodal models for identity document manipulation, identifying Script-Dependent Generative Instability (SDGI) as a recurring typographic failure mode in non-Latin script inpainting. Finally, zero-shot benchmarking on unseen synthesised ID cards shows that even the strongest publicly available models achieve APCER values above 25% under security-oriented operating conditions, highlighting substantial limits in cross-domain generalisation. We conclude by outlining future directions toward forensically grounded, privacy-preserving, and legally accountable identity verification systems.
137. 【2607.01437】How Much Future Helps? A Controlled Study of Future-Privileged Supervision for Causal Egocentric Gaze Estimation
链接:https://arxiv.org/abs/2607.01437
作者:Jia Li,Wenjie Zhao,Fnu Atisri,Sanskriti Aripineni,Shijian Deng,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-world applications require, applications require strictly, commonly studied, process the full, full video
备注: Accepted to the 7th International Workshop on Eye and Gaze in Computer Vision (GAZE 2026), CVPR 2026. Best Paper Award
点击查看摘要
Abstract:Egocentric gaze estimation is commonly studied using models that process the full video with access to future frames, while real-world applications require strictly causal, online prediction. This discrepancy raises key questions: Does future context inherently provide valuable signals for gaze estimation? If so, how much future look-ahead optimally supervises a causal model during training? To investigate, we propose a controlled framework featuring a future-aware branch that accesses a tunable look-ahead horizon during training but is discarded at inference. This design isolates the impact of future context while keeping the inference architecture fixed and strictly causal. Across EGTEA Gaze+ and Ego4D, we find that future-privileged supervision consistently improves causal gaze prediction, confirming its utility. However, performance gains do not increase monotonically with longer look-ahead, but rather peak within a bounded temporal regime. Specifically, optimal performance corresponds to roughly 1.7--3.3 seconds of future context ($H{\in}[5, 10]$) on EGTEA Gaze+ and 2.7 seconds ($H{=}10$) on Ego4D. Our results demonstrate that lightweight causal models can effectively absorb future-aware signals, providing practical guidance for real-time egocentric gaze modeling.
138. 【2607.01435】Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network
链接:https://arxiv.org/abs/2607.01435
作者:Neda Abdolrahimi,Thiru Siddharth,Frank Sicongchen,Vir V Phoha
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Augmented Reality, Virtual and Augmented, Significant advancement, integration into diverse, diverse aspects
备注:
点击查看摘要
Abstract:Significant advancement of immersive technologies such as Virtual and Augmented Reality (VR/AR) and their integration into diverse aspects of modern life need authentication interfaces that are secure, intuitive, and compatible with embodied interaction. Traditional methods such as passwords, PINs, and device-based logins, break immersion and rely on external hardware. Recent 3D-specific behavioral approaches, such as hand-gesture, eye-tracking, and electroencephalography (EEG)-based methods, offer promising alternatives but often require specialized sensors or constrain natural movement, limiting usability in dynamic environments. We present Sign in the Air to Unlock, an in-air signature interface that enables users to authenticate by signing naturally in 3D space which is a familiar, personal, and reproducible gesture. To realize this interface, we design a point-voxel Cross-Attention Network (PV-Net) that jointly models local motion dynamics and global spatial structure from 3D trajectories. The model is evaluated on two datasets: the public DeepAirSig dataset (1,800 signatures from 40 users) and ImmAirsig, a new dataset collected using Meta Quest 2 in immersive VR (880 samples from 22 users). PV-Net achieves an Equal Error Rate of 2.5% on DeepAirSig and 76% classification accuracy on ImmAirSig. These findings highlight the potential of 3D behavioral interfaces for seamless, user-centric authentication that merges security with natural interaction in immersive environments.
139. 【2607.01420】MultAttnAttrib: Training-Free Multimodal Attribution in Long Document Question Answering
链接:https://arxiv.org/abs/2607.01420
作者:Dang Quang Thien Tran,Quang V. Dang,Vinamra Tyagi,Sai Soorya Rao Veeravalli,Trang Nguyen,Ryan A. Rossi,Franck Dernoncourt,Nedim Lipka,Koustava Goswami,Samyadeep Basu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:accurately attributing generated, attributing generated answers, accurately attributing, systems are increasingly, increasingly deployed
备注: 25 pages (8 main, 17 references + appendix), 15 figures, Submitted to EMNLP 2026 Conference (Long Paper)
点击查看摘要
Abstract:As grounded QA systems are increasingly deployed in AI assistants, accurately attributing generated answers to evidence is critical for user trust and model safety. While unimodal attributions have been explored in depth, the multimodal setting remains relatively under-researched. As a result, we introduce MultAttnAttrib, a training-free attribution-generation method that leverages a model's prefill pass, selected attention heads, and calibrated thresholds to locate source evidence within a document. To establish baseline results for the method, we introduce MultAttrEval, a complementary benchmark dataset annotated with fine-grained, ground-truth attributions for answer components grounded in multimodal source documents. To our knowledge, this is the first evaluation dataset designed specifically for multimodal attribution in long-form documents. Experimental results show that MultAttnAttrib consistently outperforms a variety of attribution-generation methods, including several strong prompting-based approaches and matches the latest frontier models such as GPT 5.4. Our method not only substantially improves attribution accuracy for both unimodal and multimodal attribution types, but also produces attributions at up to one-seventh of the direct inference latency compared to prompting on the same base model.
140. 【2607.01416】Beyond Heatmaps: Unsupervised Concept-Graph Reasoning for Interpretable Visual Explanation
链接:https://arxiv.org/abs/2607.01416
作者:Md Mohasin Hossain(1 and 2),Anar Amirli(4),Robert Leist(1),Md Abdul Kadir(1 and 3),Daniel Sonntag(1 and 3) ((1) German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany, (2) Saarland University, Saarbrücken, Germany, (3) Oldenburg University, Oldenburg, Germany, (4) BEGO GmbH amp; Co. KG, Bremen, Germany)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Concept Bottleneck Models, Concept Bottleneck Model, Graph-based Concept Bottleneck, Non-negative Matrix Factorization, intrinsically interpretable alternative
备注: Accepted at the IJCAI-ECAI 2026 Workshop on Explainable Artificial Intelligence (XAI), Bremen, Germany. 7 pages, 4 figures
点击查看摘要
Abstract:Concept Bottleneck Models (CBMs) provide an intrinsically interpretable alternative to post-hoc explanations. However, existing CBMs often rely on predefined concept vocabularies or supervised annotations, lack explicit concept grounding, and summarize each concept with a single image-level score -- discarding spatial recurrence and inter-concept dependencies. We propose a Graph-based Concept Bottleneck Model (G-CBM), an intrinsically interpretable framework that performs unsupervised concept discovery via Non-negative Matrix Factorization (NMF) and represents the discovered concepts as nodes in a per-image concept-graph representation. G-CBM matches region-level features to these concept nodes -- providing concept grounding and capturing concept recurrence across the image -- and applies a \emph{tunable concept filtering threshold} $\tau$ to suppress weak region-level features. A Graph Attention Network (GAT) then performs concept-level reasoning by modeling nonlinear dependencies across nodes. Across ImageNet, HAM10000, PH2, and Derm7pt, G-CBM achieves an average relative AUC improvement of 3.7\% over a ResNet-50 baseline. Concept filtering frequently improves predictive performance while inducing selective concept use, achieving peak AUC of $0.96$ on PH2 with only 2 of 10 concepts and 0.92 on HAM10000 with 3.8 of 9 concepts. On dermoscopy benchmarks, G-CBM is competitive with supervised approaches requiring external annotations. Deletion/insertion analyses with random ablation controls show that the learned concept ranking faithfully reflects model predictions.
141. 【2607.01401】NeuroBridge: Bridging Multi-Task MRI Knowledge for Neurodegenerative Disease Diagnosis
链接:https://arxiv.org/abs/2607.01401
作者:Mengyu Li,Guoyao Shen,Chad W. Farris,Xin Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:mild cognitive impairment, Accurate MRI-based identification, identification of Alzheimer, Alzheimer disease, related dementias remains
备注: 5 figures. 3 tables
点击查看摘要
Abstract:INTRODUCTION: Accurate MRI-based identification of Alzheimer's disease (AD), mild cognitive impairment (MCI), and related dementias remains challenging because disease-related structural changes are often subtle and heterogeneous. We developed NeuroBridge, a clinically guided multi-task MRI framework for neurodegenerative disease diagnosis. METHODS: NeuroBridge integrates large-scale self-supervised MRI pretraining with hippocampal segmentation, hippocampal atrophy classification, and reconstruction objectives, followed by gated fusion fine-tuning. Performance was evaluated across ADNI and OASIS cohorts, including cross-cohort transfer, probability-based analysis, and opportunistic screening. RESULTS: NeuroBridge achieved the highest performance across evaluated classification tasks, reaching 88.17% accuracy for AD versus cognitively normal controls in ADNI and 82.78% in OASIS. The largest gains occurred in MCI-related and mixed-diagnosis settings. The framework demonstrated strong cross-cohort generalization, systematic associations between predicted-class probability and accuracy, and the feasibility of probability-based opportunistic screening. DISCUSSION: Clinically guided multi-task representation learning improves neurodegenerative MRI diagnosis beyond conventional single-task approaches. NeuroBridge provides a robust and scalable framework for dementia assessment and MRI-based opportunistic screening.
142. 【2607.01396】Computer Vision for Wildlife Monitoring: Detecting Brown Howler Monkeys using YOLO
链接:https://arxiv.org/abs/2607.01396
作者:Gabriel Ferri Schneider,Guido Luis Glufke Mainardi,Paulo Ricardo Knob,Patrícia Dias,Márcia Jardim,Júlio César Bicca-Marques,Soraia Raupp Musse
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Urban expansion threatens, threatens global biodiversity, expansion threatens global, affecting arboreal species, arboreal species due
备注: Accepted on International Conference on Computer Animation, Social Agents, and Extended Reality '26 (CASAXR 26)
点击查看摘要
Abstract:Urban expansion threatens global biodiversity, especially affecting arboreal species due to the fragmentation of forest habitats. The movement of arboreal species across disjointed forest patches increases mortality risk and, thus, compromises their conservation. In this context, the installation of canopy bridges can be a viable strategy; yet continuous monitoring of their use by arboreal species is essential for ensuring their effectiveness, typically carried out with the aid of camera traps. However, this method often produces false-positive images that demand time from conservationists for review. In this context, computer vision algorithms can optimize the task of detecting target species using the canopy bridges. In this study, we explored the automatic detection of brown howler monkeys (Alouatta guariba) in videos obtained by camera traps. Given the need for a large number of annotated images of the target animals to train the algorithms, we tested the incorporation of auxiliary data to improve detection models, fine-tuning the YOLOv10 framework using varying proportions of them. The improvement of these automatic detection techniques contributes to conservation efforts, by providing automatic tools to monitor solutions that minimize the impact of human interference in animals habitats.
143. 【2607.01395】Rethinking Generic Object Tracking Toward Human-Level Perceptual Intelligence
链接:https://arxiv.org/abs/2607.01395
作者:Shih-Fang Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
关键词:human visual perception, human visual, visual perception lies, external world, visual
备注: Ph.D. dissertation, National Yang Ming Chiao Tung University, 2026. arXiv admin note: substantial text overlap with [arXiv:2602.14771](https://arxiv.org/abs/2602.14771)
点击查看摘要
Abstract:At the heart of human visual perception lies the ability to maintain a continuous and coherent understanding of the external world. By integrating observations with accumulated experience, the human visual system can continuously adapt to variations in both the target and its surrounding environment, while preserving robust visual continuity as scene dynamics evolve. Human vision can therefore integrate prior knowledge, spatial geometry, and semantic context to understand complex scenes and their changes. As a core problem in computer vision, visual object tracking aims to bring machine perception closer to human visual perception. These capabilities are central to the task of Generic Object Tracking (GOT). In this task, a visual tracker is initialized only with the bounding box of an arbitrarily specified target in the first frame, and must continuously localize the target in subsequent dynamic visual streams. However, future events, observations, and real-world variations are inherently unpredictable; therefore, the model's generalization and online adaptation capabilities remain bottlenecks. Tracking reliability can deteriorate when the target undergoes severe deformation, is affected by complex distractors, encounters significant environmental changes, or belongs to a category unseen during training. This dissertation aims to narrow the gap between machine visual tracking systems and human visual perception by proposing a series of methods that systematically enhance the target discrimination, robust adaptation, and geometric reasoning capabilities of tracking models.
144. 【2607.01383】MIBE: Multi-subject Interaction Benchmark and Evaluator for Personalized Image Generation
链接:https://arxiv.org/abs/2607.01383
作者:Zhihan Chen,Yuhuan Zhao,Yijie Zhu,Xinyu Yao,Mengcong Ren,Suwen Wang,Qiuyang Yin,Yuchen Sun,Qin Wang,Lu Xin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-subject Interaction Benchmark, Multi-subject personalized image, personalized image generation, image generation requires, Multi-subject Interaction
备注:
点击查看摘要
Abstract:Multi-subject personalized image generation requires the precise rendering of all requested reference identities and their specified interactions based on a guiding prompt. However, state-of-the-art models still struggle with this process, frequently omitting subjects, failing to preserve reference appearances, or misattributing interactions. Furthermore, existing metrics designed primarily for single-subject fidelity cannot reliably capture these errors, suffering severe degradation in ranking separability and failing to align with human preference as the subject count increases. To address this gap, we introduce Multi-subject Interaction Benchmark and Evaluator (MIBE), a unified framework comprising a Multi-subject Interaction Benchmark (MIB) and a Multi-subject Interaction Evaluator (MIE). MIB systematically covers diverse relation types and scene complexities through a decoupled data regime. This consists of a 60K-pair VLM-labeled Silver Set for scalable metric training and a 4K-pair double-blind Human Evaluation Gold Set covering a diverse range of state-of-the-art generators, with the Silver Set reaching 95.1% cross-VLM preference agreement. To demonstrate the utility of this benchmark, we present MIE, a lightweight, reference-conditioned evaluator trained exclusively on the Silver Set with a dual-head ranking and diagnosis objective. MIE exhibits strong cross-generator generalization on the Gold Set, achieving 0.922 overall pairwise accuracy against human preference, including 0.982 on seen generators and 0.884 on unseen generators. By outperforming a broad spectrum of baseline metrics, including CLIP and DINO variants, MIE demonstrates that diagnostic supervision can preserve ranking separability and human alignment where traditional evaluators collapse.
145. 【2607.01370】MapDreamer: Aerial Imagery Conditioned Latent Diffusion for Lane-Level Map Generation
链接:https://arxiv.org/abs/2607.01370
作者:Julian Brandes,Philipp Crocoll,Wolfram Burgard
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High definition map, High definition, definition map generation, autonomous driving, process at scale
备注: Accepted at ECCV 2026
点击查看摘要
Abstract:High definition map generation is essential for autonomous driving, yet remains a labor-intensive process at scale. We present MapDreamer, a generative diffusion model that synthesizes lane-level vector maps with explicit topology directly from a single aerial image. MapDreamer learns a compact latent representation of lane centerlines and their topological relations using a variational autoencoder and predicts graphs with a transformer-based latent diffusion model. To align generated maps with the observed scene, we condition each denoising step on dense aerial features injected through cross-attention. To handle the varying number of lanes across scenes, we propose a lane cardinality module paired with background ghost lane latents, a learned buffer that prevents slot collapse during diffusion. Furthermore, we introduce a sliding-window global graph aggregation strategy that stitches local tiles into city-scale maps while preserving connectivity through encoded lane boundaries. Experiments on UrbanLaneGraph derived from Argoverse 2 show improved geometric and topological fidelity over non-generative baselines.
146. 【2607.01365】Multi-modal Rail Crossing Safety Analysis
链接:https://arxiv.org/abs/2607.01365
作者:Paimon Goulart,Chansong Lim,Nícolas Roque dos Santos,Yue Dong,Sheldon Peterson,Jia Chen,Evangelos E. Papalexakis
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:leverage visual cues, Federal Railroad Administration, leverage visual, visual cues, official accident reports
备注:
点击查看摘要
Abstract:Given one or more images of a railway crossing, can we leverage visual cues that allow us to robustly estimate how safe it is? Can we improve our ability to do so by introducing structured data (such as official accident reports) about the accident history of that crossing into our models? In this work, we explore how to best answer those questions towards building an AI system that can ingest multi-modal data for railway crossings and provide safety assessment and scores that align with expert opinion and with safety scoring used by the Federal Railroad Administration (FRA). To that end, we propose a proof-of-concept pipeline that delivers on that goal, while at the same time exploring and tackling a number of critical research challenges that pertain to different parts of the pipeline, from data preparation to different learning paradigms that can allow us to realize such a system. Indicatively, our proposed system identifies HIGH-RISK and LOW-RISK crossings with a macro F1 score of 0.757 and estimates FRA-based safety scores with an RMSE of 0.078 and correlation of 0.492 using a routed fine-tuned compact VLM pipeline, while producing qualitative results that align with domain-expert assessment.
147. 【2607.01353】Spatial-Temporal Expert Learning for Video-based Person Re-identification
链接:https://arxiv.org/abs/2607.01353
作者:Xiaofei Hui,Pengfei Wang,Evan Ling,Dezhao Huang,Keng Teck Ma,Minhoe Hur,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video-based person re-identification, query video clips, gallery video clips, video clips, Video-based person
备注: Accepted to V3SC 2026 @ ICPR
点击查看摘要
Abstract:Video-based person re-identification (Re-ID) aims to retrieve the same identity in the query video clips from the gallery video clips. To solve this problem, exploiting fine-grained features is of great importance, especially when discriminating identities that are similar in appearance. In this paper, we propose to enhance the ability to explore fine-grained information with a novel input-aware extendable expert module. Instead of updating the network parameters with every sample in the dataset, we aim to train the experts within specific subsets that only contain similar samples and promote their ability to exploit fine-grained information within these similar samples. To achieve this goal, we incorporate two mechanisms in this module: input-aware expert selection mechanism and spatial-temporal selection mechanism. The first mechanism dynamically activates a set of experts on subsets of similar samples, pushing the experts to exploit subtle differences between these similar samples, while the second one further increases their sensitivity to the fine-grained differences in spatial and temporal aspects and allows the experts to dynamically utilize them for different input samples. In addition, to facilitate the expert module, we design an extendable scheme that allows the module to flexibly add new experts when necessary. As a result, our method achieves outstanding performance on two large-scale datasets.
148. 【2607.01312】KathaTrace: Diagnosing Semantic Trajectory Collapse in Generated Visual Narratives
链接:https://arxiv.org/abs/2607.01312
作者:Jamuna S. Murthy,Amin Karimi Monsefi,Rajiv Ramnath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:viewers understand stories, children media, film previsualization, stories from images, transition meaning
备注:
点击查看摘要
Abstract:Visual narratives are central to storyboards, comics, children's media, and film previsualization, where viewers understand stories from images alone. Recent generators such as StoryDiffusion produce coherent sequences, but visual coherence does not guarantee that source-story transition meaning remains recoverable. Existing benchmarks assess visual quality, content faithfulness, and scene coherence, but miss a critical failure mode: storyboards where scenes appear visually coherent while the semantic link between scenes disappears. We introduce KathaTrace, a generator-agnostic protocol for diagnosing semantic trajectory collapse, defined as the loss of transition meaning needed to understand how one scene follows another. KathaTrace evaluates transitions under three evidence conditions: text-only, image-only, and text-plus-image, and filters ambiguous items. We contribute KathaBench-25K, with 5,000 narratives from classical collections including Aesop, Panchatantra, and Kathasaritasagara, 20,000 transitions, and 28,712 recoverability questions. We define Semantic Trajectory Gap, or STG, as text-only minus image-only recoverability, measuring transition meaning lost during visualization. Human validation yields Fleiss' kappa = 0.845. Experiments across state-of-the-art generators show substantial STG of 23.5 +/- 1.3. Semantic Compass, an actionability probe, uses KathaTrace signals for post-generation repair and improves storyboard selection.
149. 【2607.01303】CPG-PAD: Concept-Informed Prompts Guided Presentation Attack Detection
链接:https://arxiv.org/abs/2607.01303
作者:Haoyuan Zhang,Xiangyu Zhu,Li Gao,Ajian Liu,Siran Peng,Zhen Lei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:face recognition systems, Presentation Attack Detection, replayed videos, printed photos, Attack Detection
备注: Accepted by IEEE Transactions on Information Forensics Security (TIFS)
点击查看摘要
Abstract:Presentation Attack Detection (PAD) serves as a crucial safeguard for face recognition systems against presentation attacks such as printed photos, replayed videos, and 3D masks. Despite significant progress, existing PAD models still struggle to generalize across unseen domains due to variations in sensors, lighting, and attack materials. Recent Vision-Language Models (VLMs) have shown strong generalization ability, yet their applications in PAD remain limited because learned prompts, typically optimized under class-label supervision, fail to explicitly align with fine-grained attack-relevant visual semantics. As a result, the learned representations often overfit domain-specific artifacts instead of capturing transferable attack cues. To address this, we propose Concept-Informed Prompts Guided Presentation Attack Detection (CPG-PAD), a framework that introduces model-level concept guidance into the prompt learning process. Specifically, we design a Visual Concept-driven Enhancement (VCE) module that employs eXplainable AI (XAI) techniques to automatically discover PAD-relevant visual concepts and generate concept-associated heatmaps providing localized fine-grained guidance. Guided by these heatmaps, a Prompt-based Concept Injection (PCI) mechanism integrates these concepts into the prompt space through a Visual-Prompt Decoder (VPD) and a concept-mapping loss, enabling prompts to align with the model's internal concept space. This design enables CPG-PAD to capture generalizable and domain-invariant attack cues while effectively suppressing dataset-specific biases. Extensive experiments across nine benchmark datasets demonstrate that CPG-PAD consistently achieves state-of-the-art cross-domain performance under multi-source, limited-source, and single-source settings.
150. 【2607.01290】AnchorSplat: Fast and Structure Consistent Detail Synthesis for Gaussian Splatting
链接:https://arxiv.org/abs/2607.01290
作者:Dexu Zhu,Jiangnan Shao,Xiaofeng Wang,Junxian Duan,Jie Cao,Zheng Zhu,Huaibo Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, high-fidelity rendering, powerful representation, representation for high-fidelity, Gaussian
备注: Accepted by ECCV2026
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-fidelity rendering. However, existing assets often suffer from quality bottlenecks such as missing details and texture noise. Prior attempts to enhance these assets via 2D image processing introduce multi-view inconsistencies and high computational costs. In this paper, we propose a novel 3D-native refinement paradigm named AnchorSplat. AnchorSplat is an end-to-end deep network operating directly on 3D structures, avoiding the expensive optimization overhead of traditional 3D-2D-3D pipelines. Crucially, AnchorSplat is a strictly source-free solution requiring no original multi-view images. Central to the proposed method is the Point Anchor Mechanism, which enforces geometric consistency via local offset constraints, mitigating ill-posed mapping and gradient confounding. Furthermore, AnchorSplat replaces iterative densification with a single-pass multiplication mechanism. To facilitate research, we construct 3DGS-SR, the first large-scale benchmark for this task. Experiments demonstrate state-of-the-art results on the 3DGS-SR dataset, with throughput up to $10^5$ times faster than optimization methods. Notably, AnchorSplat exhibits robust zero-shot generalization across diverse data distributions, including generative model outputs and real-world scans.
151. 【2607.01272】Benchmarking Federated Learning and Knowledge Distillation for Point Cloud Classification
链接:https://arxiv.org/abs/2607.01272
作者:Aizierjiang Aiersilan
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
关键词:resource-constrained settings faces, limited edge hardware, point cloud analysis, point cloud classification, point cloud
备注: We are pleased to announce that this paper has been accepted by the 19th European Conference on Computer Vision (ECCV 2026). We appreciate the valuable feedback from the reviewers and look forward to sharing our findings with the community
点击查看摘要
Abstract:Deploying 3D point cloud analysis in privacy-sensitive, resource-constrained settings faces two barriers: data cannot be centralized, and models must run on limited edge hardware. We present a multi-seed benchmark jointly evaluating federated learning (FL) and knowledge distillation (KD) for 3D point cloud classification. It spans 13 FL algorithms and 10 KD objectives (a 130-pair cross-product) across 504 training runs, evaluated on ModelNet40 and a clinical craniosynostosis dataset. We report three findings. First, under extreme non-IID label skew, standalone FL degrades sharply: on ModelNet40, the strongest method reaches 76.32% against a 92.26% centralized reference; on clinical data, the best reaches 75.83% against 100%. Second, distillation successfully compresses the teacher into a student 74.51% smaller and roughly twice as fast at inference, often matching or surpassing the teacher. Third, the combined pipeline exposes an evaluation pitfall: when distillation keeps a hard-label cross-entropy term on a labeled proxy split, a collapsed federated teacher (8.50%) paired with Logit-MSE still yields a 92.94% student. This 84.4-point gap reflects the proxy labels rather than the federated model, reusing the very labels whose privacy motivated federation. Objectives without hard labels instead track teacher quality ($r \approx 0.99$) and collapse when the teacher does. We therefore recommend evaluating FL-KD pipelines with label-free distillation so reported accuracy reflects the federated teacher, not the proxy.
152. 【2607.02428】Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI
链接:https://arxiv.org/abs/2607.02428
作者:Qing Lyu,Jianxu Wang,Mohammad Kawas,Ge Wang,Christopher T. Whitlow
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:magnetic resonance imaging, blur diagnostically relevant, diagnostically relevant structures, Accelerated magnetic resonance, global image metrics
备注:
点击查看摘要
Abstract:Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk scores in the same inference pass. We evaluate SA-RDM-DC on multi-coil fastMRI knee data at acceleration factors of 4, 8, and 12, with fastMRI+ pathology annotations for region-level and classifier-based task preservation, and on SKM-TEA for zero-shot and fine-tuned protocol-shift evaluation. Compared with zero-filled reconstruction, UNet-image-SENSE, DC-UNet, Score-Diffusion, ELF-Diff, SENSE-VarNet, and MoDL baselines, SA-RDM-DC achieves the highest SSIM across fastMRI acceleration factors while retaining subsecond per-slice inference and avoiding the long sampling time of iterative diffusion baselines. In pathology-aware analysis, SA-RDM-DC preserves lesion-region structural fidelity and reduces meniscus prediction instability. Its self-auditing scores strongly identify high-error reconstructions on fastMRI and partially transfer as a selective-review signal under SKM-TEA protocol shift. These results support reconstruction evaluation that jointly considers image fidelity, pathology preservation, runtime, and case-specific reliability.
153. 【2607.02127】Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health
链接:https://arxiv.org/abs/2607.02127
作者:Jan Ernsting,Gunnar Paul Kordes,Nils Johannaber,Lynn Ogoniak,Wolfgang Roll,Tim Hahn,Alexander Siegfried Busch,Benjamin Risse
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:congenital and endocrine, endocrine disorders, urinary dysfunction, clinically relevant, sexual or urinary
备注:
点击查看摘要
Abstract:Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$). We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2607.02127 [eess.IV]
(or
arXiv:2607.02127v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2607.02127
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
154. 【2607.01731】Quantum-Inspired Vision: Leveraging Wave-Particle Duality for Low-Illumination Enhancement
链接:https://arxiv.org/abs/2607.01731
作者:Yiquan Gao
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
关键词:recent Data Relativistic, Data Relativistic Uncertainty, Relativistic Uncertainty, Data Relativistic, framework by formalizing
备注:
点击查看摘要
Abstract:This study provides a theoretical expansion of the recent Data Relativistic Uncertainty (DRU) framework by formalizing a physics-to-AI paradigm for image enhancement. By modeling images as probabilistic wave functions rather than deterministic states, the paradigm explicitly integrates wave-particle duality to illustrate the system flow of how DRU leverages the intrinsic physical uncertainty of light, a dimension requiring further theoretical discussion. Consequently, this paradigm provides a rigorous Explainable AI (XAI) approach that enhances the interpretability of how DRU mitigates illumination bias and maintains robustness against data noise.
155. 【2607.01478】Boundary-Aware Quantization: Finite-Scale Decision Geometry of Neural Classifiers
链接:https://arxiv.org/abs/2607.01478
作者:O.M. Kiselev
类目:Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:slice-boundary Jaccard distance, measured quantization-induced decision-boundary, boundary Jaccard, local logit-margin radii, first-order boundary displacement
备注: 7 pages, 2 figures, 6 tables
点击查看摘要
Abstract:We measured quantization-induced decision-boundary changes using local logit-margin radii, first-order boundary displacement, normal variation, slice-boundary Jaccard distance, grid prediction changes, multiclass junction counts, and low-margin boundary-band flips. On the digits benchmark, 8-bit weight quantization preserved all test labels while producing boundary-mask Jaccard \(0.428\) on the PCA slice; at 4 bits, accuracy remained \(0.9733\), while boundary Jaccard rose to \(0.970\) and median local boundary shift reached \(0.0290\). Interpolation between adjacent quantization levels localized the visible reconfigurations at multiclass junctions, with 12, 34, and 17 triple-junction cells in the selected transitions. Calibration-to-test stopping reduced the digits held-out flip rate from \(0.0094\) to \(0.0022\) and boundary Jaccard from \(0.825\) to \(0.524\); the same stopping rule also reduced flips on MNIST and Fashion-MNIST. On official CIFAR-10 subsets, PTQ-W selected by accuracy gave 6-bit flip \(0.0367\) and boundary Jaccard \(0.184\), whereas boundary-aware stopping selected 8-bit flip \(0.0083\) and boundary Jaccard \(0.048\). On full CIFAR-10 with three seeds, 6-bit PTQ-W lost \(0.0029\) accuracy relative to float, changed \(5.3\%\) of held-out decisions, and changed \(24.5\%\) of low-margin boundary-band decisions. A fixed-bit boundary-gap rounding term changed the trade-off at 4 bits by reducing boundary Jaccard from \(0.457\) to \(0.435\) and boundary-band pair-order flip from \(0.3600\) to \(0.3558\), with an accuracy trade-off; the 3-bit stress test exposed the tuning limit of this surrogate. Calibration boundary Jaccard predicted held-out boundary Jaccard across PTQ-W and optimized rounding variants with \(r=0.947\)--\(0.994\).

