本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新641篇论文，其中：

自然语言处理89篇
信息检索10篇
计算机视觉150篇

自然语言处理

1. 【2604.02324】Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

作者：Daiwei Chen,Zhoutong Fu,Chengming Jiang,Haichao Zhang,Ran Zhou,Tan Wang,Chunnan Yao,Guoyao Li,Rui Cai,Yihan Cao,Ruijie Jiang,Fedor Borisyuk,Jianqiang Shen,Jingwei Wu,Ramya Korlakai Vinayak

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Grounded Token Initialization, learnable vocabulary tokens, Language models, domain-specific tasks, increasingly extended

备注：

点击查看摘要

Abstract:Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

2. 【2604.02322】Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

链接：https://arxiv.org/abs/2604.02322

作者：Bangji Yang,Hongbo Ma,Jiajun Fan,Ge Liu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models employing, achieve strong performance, reasoning achieve strong

备注： 43 pages, 5 figures, 24 tables

点击查看摘要

Abstract:Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

3. 【2604.02319】No Single Best Model for Diversity: Learning a Router for Sample Diversity

链接：https://arxiv.org/abs/2604.02319

作者：Yuhan Liu,Fangyuan Xu,Vishakh Padmakumar,Daphne Ippolito,Eunsol Choi

类目：Computation and Language (cs.CL)

关键词：permit a large, step towards satisfying, range of users, answer set, wide range

备注： under review at COLM 2026

点击查看摘要

Abstract:When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

4. 【2604.02309】go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

链接：https://arxiv.org/abs/2604.02309

作者：Torque Dandachi,Sophia Diggs-Galligan

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Doubly stochastic matrices, Doubly stochastic, stochastic matrices enable, enable learned mixing, stochastic matrices

备注： 29 pages, 30 figures, 9 tables. Includes supplementary material

点击查看摘要

Abstract:Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.

5. 【2604.02276】De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

链接：https://arxiv.org/abs/2604.02276

作者：Keerat Guliani,Deepkamal Gill,David Landsman,Nima Eshraghi,Krishna Kumar,Lovedeep Gondara

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：encode legally binding, legally binding obligations, documents encode legally, systems must respect, Jure

备注：

点击查看摘要

Abstract:Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.

6. 【2604.02217】VISTA: Visualization of Token Attribution via Efficient Analysis

链接：https://arxiv.org/abs/2604.02217

作者：Syed Ahmed,Bharathi Vokkaliga Ganesh,Jagadish Babu P,Karthick Selvaraj,Praneeth Talluri,Sanket Hingne,Anubhav Kumar,Anushka Yadav,Pratham Kumar Verma,Kiranmayee Janardhan,Mandanna A N

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Understanding how Large, Large Language, Language Models, significant challenge

备注： 12 pages, 3 figures

点击查看摘要

Abstract:Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at this https URL

7. 【2604.02209】CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

链接：https://arxiv.org/abs/2604.02209

作者：Youssef Saidi,Haroun Elleuch,Fethi Bougares

类目：Computation and Language (cs.CL)

关键词：directly extract entities, Arabic Common Voice, Named Entity Recognition, aims to directly, directly extract

备注： Accepted at OSACT 2026

点击查看摘要

Abstract:End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech this https URL.

8. 【2604.02207】Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

链接：https://arxiv.org/abs/2604.02207

作者：Yosuke Yamagishi,Atsushi Takamatsu,Yasunori Hamaguchi,Tomohiro Kikuchi,Shouhei Hanaoka,Takeharu Yoshikawa,Osamu Abe

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Accurate translation, LLM, clinical communication, multilingual research, LLM judges

备注： 25 pages, 4 figures

点击查看摘要

Abstract:Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in 93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.

9. 【2604.02200】owards Position-Robust Talent Recommendation via Large Language Models

链接：https://arxiv.org/abs/2604.02200

作者：Silin Du,Hongyan Liu

类目：Computation and Language (cs.CL)

关键词：high recruitment costs, long hiring cycles, high recruitment, recruitment costs, hiring cycles

备注：

点击查看摘要

Abstract:Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.

10. 【2604.02194】Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

链接：https://arxiv.org/abs/2604.02194

作者：Jaemin Kim,Jae O Lee,Sumyeong Ahn,Seo Yeon Park

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Retrieval-Augmented Language Models, Large Language Models, Language Models, demonstrated significant potential, Retrieval-Augmented Language

备注：

点击查看摘要

Abstract:Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.

11. 【2604.02178】he Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

链接：https://arxiv.org/abs/2604.02178

作者：Jeremy Herbst,Jae Hee Lee,Stefan Wermter

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：scaling Large Language, Large Language Models, Large Language, scaling Large, Language Models

备注：

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: this https URL

12. 【2604.02176】Adam's Law: Textual Frequency Law on Large Language Models

链接：https://arxiv.org/abs/2604.02176

作者：Hongyuan Adam Lu,Z.L.,Victor Wei,Zefan Zhang,Zhao Hong,Qiqi Xiang,Bowen Cao,Wai Lam

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, Large Language, relatedness to Large, textual frequency

备注：

点击查看摘要

Abstract:While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

13. 【2604.02171】Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions

链接：https://arxiv.org/abs/2604.02171

作者：Atilla Kaan Alkan,Felix Grezes,Jennifer Lynn Bartlett,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Context Aware Representations, mention coreference resolution, coreference resolution, present our participation, cross-document software mention

备注： 8 pages

点击查看摘要

Abstract:We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.

14. 【2604.02156】AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

链接：https://arxiv.org/abs/2604.02156

作者：Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi

类目：Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：severe power-law distributions, Unified Astronomy Thesaurus, challenge standard classification, power-law distributions, distributions that challenge

备注： 9 pages, 2 figures

点击查看摘要

Abstract:Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.

15. 【2604.02155】Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

链接：https://arxiv.org/abs/2604.02155

作者：Xuan Qi

类目：Computation and Language (cs.CL)

关键词：taking action, Function Calling Leaderboard, Berkeley Function Calling, language agent, reasoning

备注： 21 pages

点击查看摘要

Abstract:How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

16. 【2604.02145】MTI: A Behavior-Based Temperament Profiling System for AI Agents

链接：https://arxiv.org/abs/2604.02145

作者：Jihoon Jeong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：standardized instrument exists, dispositional differences, exhibit fundamentally, standardized instrument, instrument exists

备注： 29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: [arXiv:2603.04722](https://arxiv.org/abs/2603.04722) )

点击查看摘要

Abstract:AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.

Comments:
29 pages, 6 figures, 12 tables. Paper #3 in the Model Medicine Series (Paper #1: arXiv:2603.04722)

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.02145 [cs.AI]

(or
arXiv:2604.02145v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.02145

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

17. 【2604.02135】GaelEval: Benchmarking LLM Performance for Scottish Gaelic

链接：https://arxiv.org/abs/2604.02135

作者：Peter Devine,William Lamb,Beatrice Alex,Ignatius Ezeani,Dawn Knight,Mícheál J. Ó Meachair,Paul Rayson,Martin Wynne

类目：Computation and Language (cs.CL)

关键词：Multilingual large language, Multilingual large, languages remains uneven, exhibit emergent, official support

备注： 13 pages, to be published in Proceedings of LLMs4SSH (workshop co-located with LREC 2026; Mallorca, Spain; May 2026)

点击查看摘要

Abstract:Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge QA task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

18. 【2604.02118】LLM-as-a-Judge for Time Series Explanations

链接：https://arxiv.org/abs/2604.02118

作者：Preetham Sivalingam,Murari Mandal,Saurabh Deshpande,Dhruv Kumar

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Evaluating factual correctness, time series, Evaluating factual, LLM generated natural, time series data

备注： Under Review

点击查看摘要

Abstract:Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

19. 【2604.02113】Reliable Control-Point Selection for Steering Reasoning in Large Language Models

链接：https://arxiv.org/abs/2604.02113

作者：Haomin Zhuang,Hojun Yoo,Xiaonan Luo,Kehan Guo,Xiangliang Zhang

类目：Computation and Language (cs.CL)

关键词：requires identifying genuine, model hidden states, constructing effective vectors, effective vectors requires, vectors requires identifying

备注：

点击查看摘要

Abstract:Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at this https URL.

20. 【2604.02102】Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations

链接：https://arxiv.org/abs/2604.02102

作者：Haitong Sun,Stephen McIntosh,Kwanghee Choi,Eunjung Yeo,Daisuke Saito,Nobuaki Minematsu

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：self-supervised speech models, self-supervised speech, directly measured, Speech representations, Speech

备注： Submitted to Interspeech 2026; 6 pages, 4 figures

点击查看摘要

Abstract:Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.

21. 【2604.02091】Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

链接：https://arxiv.org/abs/2604.02091

作者：Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：refining retrieval results, play a pivotal, pivotal role, role in refining, results for Retrieval-Augmented

备注： 16 pages

点击查看摘要

Abstract:Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

22. 【2604.02051】Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

链接：https://arxiv.org/abs/2604.02051

作者：Jaber Jaber,Osama Jaber

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：shared weight block, Recursive transformers reuse, multiple depth steps, reuse a shared, shared weight

备注： 10 pages, 5 tables, 1 figure, 1 algorithm. Code: [this https URL](https://github.com/RightNow-AI/ouroboros)

点击查看摘要

Abstract:Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: this https URL

23. 【2604.02047】Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

链接：https://arxiv.org/abs/2604.02047

作者：Tao Jin,Phuong Minh Nguyen,Naoya Inoue

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Speculative decoding accelerates, Speculative decoding, drafting multiple candidate, language model inference, decoding accelerates large

备注：

点击查看摘要

Abstract:Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

24. 【2604.02045】BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

链接：https://arxiv.org/abs/2604.02045

作者：Nicolas Boizard,Théo Deschamps-Berger,Hippolyte Gisserot-Boukhlef,Céline Hudelot,Pierre Colombo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Transforming causal generative, BERT-style architectures, bidirectional encoders offers, Transforming causal, offers a powerful

备注： 30 pages, 16 figures, 10 tables

点击查看摘要

Abstract:Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

25. 【2604.02043】racking the emergence of linguistic structure in self-supervised models learning from speech

链接：https://arxiv.org/abs/2604.02043

作者：Marianne de Heer Kloots,Martijn Bentum,Hosein Mohebbi,Charlotte Pouw,Gaofei Shen,Willem Zuidema

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词：Self-supervised speech models, learn effective representations, Self-supervised speech, speech models learn, models learn effective

备注：

点击查看摘要

Abstract:Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

26. 【2604.02028】Why Gaussian Diffusion Models Fail on Discrete Data?

链接：https://arxiv.org/abs/2604.02028

作者：Alexander Shabalin,Simon Elistratov,Viacheslav Meshchaninov,Ildus Sadrtdinov,Dmitry Vetrov

类目：Computation and Language (cs.CL)

关键词：Gaussian diffusion models, data remains challenging, remains challenging, Diffusion models, Random Hierarchy Model

备注：

点击查看摘要

Abstract:Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

27. 【2604.02008】$k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection

链接：https://arxiv.org/abs/2604.02008

作者：Kahim Wong,Kemou Li,Haiwei Wu,Jiantao Zhou

类目：Computation and Language (cs.CL)

关键词：reliable forensic analysis, mitigating LLM misuse, LLM-generated text, essential for reliable, reliable forensic

备注：

点击查看摘要

Abstract:LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.

28. 【2604.01993】SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

链接：https://arxiv.org/abs/2604.01993

作者：Daeyong Kwon,Soyoung Yoon,Seung-won Hwang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reward Large Language, Large Language Models, frequently reward Large, Large Language, reward Large

备注：

点击查看摘要

Abstract:Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

29. 【2604.01977】RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

链接：https://arxiv.org/abs/2604.01977

作者：Ayush Garg,Sophia Hager,Jacob Montiel,Aditya Tiwari,Michael Gentile,Zach Reavis,David Magnotti,Wayne Fullen

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词：newly disclosed Common, disclosed Common Vulnerabilities, Security teams face, disclosed Common, National Vulnerability Database

备注： 11 pages, 10 figures. To be submitted to CAMLIS 2026

点击查看摘要

Abstract:Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.

Comments:
11 pages, 10 figures. To be submitted to CAMLIS 2026

Subjects:

Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

Cite as:
arXiv:2604.01977 [cs.CR]

(or
arXiv:2604.01977v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2604.01977

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

30. 【2604.01965】Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

链接：https://arxiv.org/abs/2604.01965

作者：Florian Kelber,Matthias Jobst,Yuni Susanti,Michael Färber

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：knowledge discovery increasingly, discovery increasingly relies, Scientific knowledge discovery, billions of parameters, knowledge discovery

备注： Accepted at NSLP@LREC 2026

点击查看摘要

Abstract:Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

31. 【2604.01957】Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

链接：https://arxiv.org/abs/2604.01957

作者：Klaudia Thellmann,Bernhard Stadler,Michael Färber

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：quality weaken confidence, Machine-translated benchmark datasets, uneven quality weaken, loss of structure, datasets reduce costs

备注： Accepted at LREC 2026

点击查看摘要

Abstract:Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

32. 【2604.01938】How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

链接：https://arxiv.org/abs/2604.01938

作者：Ramon Ferrer-i-Cancho

类目：Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Physics and Society (physics.soc-ph)

关键词：swap distance minimization, swap distance, vertices produces, produces the permutation, adjacent elements

备注：

点击查看摘要

Abstract:The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

33. 【2604.01936】Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification

链接：https://arxiv.org/abs/2604.01936

作者：Géraud Faye,Benjamin Icard,Morgane Casanova,Guillaume Gadek,Guillaume Gravier,Wassila Ouerdane,Céline Hudelot,Sylvain Gatepaille,Paul Égré

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：mix oriented messages, factual reports intended, tend to mix, mix oriented, oriented messages

备注：

点击查看摘要

Abstract:Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.01936 [cs.CL]

(or
arXiv:2604.01936v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01936

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

34. 【2604.01925】ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues

链接：https://arxiv.org/abs/2604.01925

作者：Bhaskara Hanuma Vedula,Darshan Anghan,Ishita Goyal,Ponnurangam Kumaraguru,Abhijnan Chakraborty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models increasingly, increasingly suppress biased, suppress biased outputs

备注：

点击查看摘要

Abstract:Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.

35. 【2604.01924】Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients

链接：https://arxiv.org/abs/2604.01924

作者：Oumaima El Khettari,Virgile Barthet,Guillaume Hocquet,Joconde Weller,Emmanuel Morin,Pierre Zweigenbaum

类目：Computation and Language (cs.CL)

关键词：electronic health record, Accurate short-term mortality, structured electronic health, heart failure, health record

备注： Accepted in LREC 2026

点击查看摘要

Abstract:Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.

36. 【2604.01916】SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations

链接：https://arxiv.org/abs/2604.01916

作者：Yiqiang Cai,Chengyan Wu,Bolei Ma,Bo Chen,Yun Xue,Julia Hirschberg,Ziwei Gong

类目：Computation and Language (cs.CL)

关键词：requires integrating multimodal, integrating multimodal signals, requires integrating, reasoning, Synergistic Uncertainty-aware REasoning

备注： ICASSP 2026

点击查看摘要

Abstract:Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.

37. 【2604.01881】HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

链接：https://arxiv.org/abs/2604.01881

作者：Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Language Models, demonstrated impressive capabilities, significant computational burden, Language Models, Video Large Language

备注：

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

38. 【2604.01853】Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution

链接：https://arxiv.org/abs/2604.01853

作者：Samuel Rose,Debarati Chakraborty

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：spelling errors exhibit, errors exhibit systematic, typically developing writers, exhibit systematic phonological, Dyslexic spelling errors

备注：

点击查看摘要

Abstract:Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.

39. 【2604.01849】From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

链接：https://arxiv.org/abs/2604.01849

作者：Liang Zhu,Haolin Chen,Lidong Zhao,Xian Wu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, amidst insufficient context, demonstrated exceptional proficiency, fully concrete code

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.

40. 【2604.01837】PLOT: Enhancing Preference Learning via Optimal Transport

链接：https://arxiv.org/abs/2604.01837

作者：Liang Zhu,Yuelin Bai,Xiankun Ren,Jiaxi Yang,Lei Zhang,Feiteng Fang,Hamid Alinejad-Rokny,Minghuan Tan,Min Yang

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, high computational costs, existing methods remain, methods remain limited

备注：

点击查看摘要

Abstract:Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.

41. 【2604.01833】Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

链接：https://arxiv.org/abs/2604.01833

作者：Yaxin Luo,Zhiqiang Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：pre-training models differs, models differs significantly, pre-training models, language pre-training models, differs significantly

备注：

点击查看摘要

Abstract:The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

42. 【2604.01787】DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

链接：https://arxiv.org/abs/2604.01787

作者：Liang Zhu,Feiteng Fang,Yuelin Bai,Longze Chen,Zhexiang Zhang,Minghuan Tan,Min Yang

类目：Computation and Language (cs.CL)

关键词：Proximal Policy Optimization, aligns Large Language, Policy Optimization, Proximal Policy, Human Feedback

备注：

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

43. 【2604.01779】aming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

链接：https://arxiv.org/abs/2604.01779

作者：Hanna Hubarava,Yingqiang Gao

类目：Computation and Language (cs.CL)

关键词：Controllable Automatic Text, produces user-tailored outputs, Controllable Automatic, Automatic Text Simplification, Automatic Text

备注：

点击查看摘要

Abstract:Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

44. 【2604.01762】FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

链接：https://arxiv.org/abs/2604.01762

作者：Juyong Jiang,Fan Wang,Hong Qi,Sunghun Kim,Jing Tang

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：adapting large language, constrained computational budgets, large language models, standard PEFT methods, adapting large

备注： The first two authors contributed equally to this work; listing order is random

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.

45. 【2604.01754】LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

链接：https://arxiv.org/abs/2604.01754

作者：Linyang He,Qiyao Yu,Hanze Dong,Baohao Liao,Xinxing Xu,Micah Goldblum,Jiang Bian,Nima Mesgarani

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large language models, human intelligence, artificial intelligence, cognitive science, hallmark of human

备注： Project page: [this https URL](https://livemathematicianbench.github.io/)

点击查看摘要

Abstract:Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.

46. 【2604.01745】Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text

链接：https://arxiv.org/abs/2604.01745

作者：Melania Berbatova,Tsvetoslav Vasev

类目：Computation and Language (cs.CL)

关键词：inadvertently blocking valuable, blocking valuable information, online communication remains, significant challenge, communication remains

备注：

点击查看摘要

Abstract:Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.

47. 【2604.01733】From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents

链接：https://arxiv.org/abs/2604.01733

作者：Meftun Akarsu,Recep Kaan Karaman,Christopher Mierbach

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：systems critically depend, systems critically, tabular data, critically depend, systematic comparison

备注： 11 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents with mixed text-and-table content. We evaluate retrieval quality via Recall@k, MRR, and nDCG, and end-to-end generation quality via Number Match, with paired bootstrap significance testing. Our results show that (1) a two-stage pipeline combining hybrid retrieval with neural reranking achieves Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin; (2) BM25 outperforms state-of-the-art dense retrieval on financial documents, challenging the common assumption that semantic search universally dominates; and (3) query expansion methods (HyDE, multi-query) and adaptive retrieval provide limited benefit for precise numerical queries, while contextual retrieval yields consistent gains. We provide ablation studies on fusion methods and reranker depth, actionable cost-accuracy recommendations, and release our full benchmark code.

48. 【2604.01711】Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition

链接：https://arxiv.org/abs/2604.01711

作者：Truc Nguyen,Then Tran,Binh Truong,Phuoc Nguyen T. H

类目：Computation and Language (cs.CL)

关键词：remains challenging due, reliable annotated data, remains challenging, annotated data, challenging due

备注： 6 pages, 2 figures. Dataset of 2,764 Vietnamese speech samples across three emotion classes

点击查看摘要

Abstract:Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

49. 【2604.01707】Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

链接：https://arxiv.org/abs/2604.01707

作者：Yanchen Wu,Tenghui Lin,Yingli Zhou,Fangyuan Zhang,Qintian Guo,Xun Zhou,Sibo Wang,Xilin Liu,Yuchi Ma,Yixiang Fang

类目：Computation and Language (cs.CL); Databases (cs.DB)

关键词：large language model, long-horizon complex tasks, enable knowledge accumulation, multi-turn dialogue, game playing

备注：

点击查看摘要

Abstract:Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

50. 【2604.01705】Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

链接：https://arxiv.org/abs/2604.01705

作者：Ruijie Yang,Yan Zhu,Peiyao Fu,Te Luo,Zhihua Wang,Xian Yang,Quanlin Li,Pinghong Zhou,Shuo Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Automatic speech recognition, Automatic speech, complex acoustic conditions, speech recognition, terminology and complex

备注： Under review at npj Digital Medicine

点击查看摘要

Abstract:Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

51. 【2604.01702】On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning

链接：https://arxiv.org/abs/2604.01702

作者：Zhaoyi Li,Xiangyu Xi,Zhengyu Chen,Wei Wang,Gangwei Jiang,Ranran Shen,Linqi Song,Ying Wei,Defu Lian

类目：Computation and Language (cs.CL)

关键词：Supervised Fine-Tuning, texttt, pivotal phase, Supervised, SFT

备注： Under Review

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.

52. 【2604.01694】MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning

链接：https://arxiv.org/abs/2604.01694

作者：Sten Rüdiger,Sebastian Raschka

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Minor Component Adaptation, adapting underutilized subspaces, Minor Component, parameter-efficient fine-tuning method, Component Adaptation

备注：

点击查看摘要

Abstract:Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.

53. 【2604.01683】Coupled Query-Key Dynamics for Attention

链接：https://arxiv.org/abs/2604.01683

作者：Barak Gahtan,Alex M. Bronstein

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Standard scaled dot-product, dot-product attention computes, attention computes scores, scaled dot-product attention, scores from static

备注：

点击查看摘要

Abstract:Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55--22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6--6.9\%), with only 0.11\% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1--7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4$\times$ longer (matching wall-clock) reaches the same perplexity, but requires 2.4$\times$ more tokens. The advantage scales to 150M ($-$6.7\%) but narrows at 350M ($-$1.0\%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 $-$6.6\%, PubMed $-$4.5\%) but degrades on heterogeneous web text ($+$10.3\%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.

54. 【2604.01682】PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment

链接：https://arxiv.org/abs/2604.01682

作者：Chenning Xu,Mao Zheng,Mingyang Song

类目：Computation and Language (cs.CL)

关键词：amplify overconfident imitation, Supervised fine-tuning, token-level hard labels, factually unsupported targets, causing hallucinations

备注：

点击查看摘要

Abstract:Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.

55. 【2604.01671】PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation

链接：https://arxiv.org/abs/2604.01671

作者：Yanxin Luo,Xiaoyu Zhang,Jing Li,Yan Gao,Donghong Han

类目：Computation and Language (cs.CL)

关键词：Emotional Support Conversation, Support Conversation, alleviate individual emotional, individual emotional distress, generating empathetic responses

备注： 14 pages, 6 figures, 5 tables. Submitted to Transactions of the Association for Computational Linguistics (TACL)

点击查看摘要

Abstract:Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: this https URL.

56. 【2604.01657】What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis

链接：https://arxiv.org/abs/2604.01657

作者：Delip Rao,Chris Callison-Burch

类目：Computation and Language (cs.CL)

关键词：rapid progress, progress in claim, lack a systematic, systematic understanding, reasoning

备注： 11 pages

点击查看摘要

Abstract:Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain -- general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.

57. 【2604.01652】hinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models

链接：https://arxiv.org/abs/2604.01652

作者：Delip Rao,Feijiang Han,Chris Callison-Burch

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：grounded claim verification, produces a short, structured rationale, binary verdict, grounded claim

备注： 15 pages

点击查看摘要

Abstract:We present ThinknCheck, a 1B-parameter verifier for grounded claim verification that first produces a short, structured rationale and then a binary verdict. We construct LLMAggreFact-Think, a 24.1k reasoning-augmented training set derived from LLMAggreFact, and fine-tune a 4-bit Gemma3 model to follow this format. On LLMAggreFact, ThinknCheck attains 78.1 balanced accuracy (BAcc), surpassing MiniCheck-7B (77.4) with 7x fewer parameters; removing the reasoning step reduces BAcc to 57.5. On SciFact, ThinknCheck reaches 64.7 BAcc, a +14.7 absolute gain over MiniCheck-7B. By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning. To probe the latter, we introduce GSMClaims and a domain-specialized variant, ThinknCheck-Science, which improves across benchmarks, including 61.0\% accuracy on GSMClaims. Overall, explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable.

58. 【2604.01639】Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations

链接：https://arxiv.org/abs/2604.01639

作者：Shou-Tzu Han,Rodrigue Rizk,KC Santosh

类目：Computation and Language (cs.CL)

关键词：Large language models, mathematical reasoning benchmarks, demonstrate strong performance, remain surprisingly fragile, Large language

备注： Preprint. Under review at COLM 2026

点击查看摘要

Abstract:Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.

59. 【2604.01634】CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

链接：https://arxiv.org/abs/2604.01634

作者：Junyoung Sung,Seungwoo Lyu,Minjun Kim,Sumin An,Arsha Nagrani,Paul Hongsuck Seo

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：connecting textual context, requires combining information, Real-world reasoning, information across modalities, connecting textual

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

60. 【2604.01630】Grounding AI-in-Education Development in Teachers' Voices: Findings from a National Survey in Indonesia

链接：https://arxiv.org/abs/2604.01630

作者：Nurul Aisyah,Muhammad Dehan Al Kautsar,Arif Hidayat,Fajri Koto

类目：Computation and Language (cs.CL)

关键词：teacher-centred evidence, systems and policies, context-appropriate AI systems, Indonesian classrooms, teachers

备注：

点击查看摘要

Abstract:Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.

61. 【2604.01624】OSCAR: Orchestrated Self-verification and Cross-path Refinement

链接：https://arxiv.org/abs/2604.01624

作者：Yash Shah,Abhijit Chakraborty,Naresh Kumar Devulapally,Vishnu Lokhande,Vivek Gupta

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：ideal hallucination mitigation, hallucination mitigation framework, trained hallucination classifier, externally trained hallucination, hallucination classifier

备注：

点击查看摘要

Abstract:Diffusion language models (DLMs) expose their denoising trajectories, offering a natural handle for inference-time control; accordingly, an ideal hallucination mitigation framework should intervene during generation using this model-native signal rather than relying on an externally trained hallucination classifier. Toward this, we formulate commitment uncertainty localization: given a denoising trajectory, identify token positions whose cross-chain entropy exceeds an unsupervised threshold before factually unreliable commitments propagate into self-consistent but incorrect outputs. We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods. We also introduce OSCAR, a training-free inference-time framework operationalizing this formulation. OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions, and then performs targeted remasking conditioned on retrieved evidence. Ablations confirm that localization and correction contribute complementary gains, robust across N in {4, 8, 16}. On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR enhances generation quality by significantly reducing hallucinated content and improving factual accuracy through uncertainty-guided remasking, which also facilitates more effective integration of retrieved evidence. Its native entropy-based uncertainty signal surpasses that of specialized trained detectors, highlighting an inherent capacity of diffusion language models to identify factual uncertainty that is not present in the sequential token commitment structure of autoregressive models. We are releasing the codebase1 to support future research on localization and uncertainty-aware generation in DLMs.

62. 【2604.01622】Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models

链接：https://arxiv.org/abs/2604.01622

作者：Shuibai Zhang,Caspian Zhuang,Chihan Cui,Zhihan Yang,Fred Zhangzhi Peng,Yanxin Zhang,Haoyue Bai,Zack Jia,Yang Zhou,Guanhua Chen,Ming Liu

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：non-autoregressive text generation, Diffusion language models, Diffusion language, models inherit token-choice, enable parallel

备注： 26 pages

点击查看摘要

Abstract:Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at this https URL.

63. 【2604.01609】Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

链接：https://arxiv.org/abs/2604.01609

作者：Ruoling Qi,Yirui Liu,Xuaner Wu,Xiangyu Wang,Ming Li,Chen Chen,Jian Chen,Yin Chen,Qizhen Weng

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, deployment of Large, Models is constrained

备注： Under Review

点击查看摘要

Abstract:The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.

64. 【2604.01562】Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones

链接：https://arxiv.org/abs/2604.01562

作者：Tianle Yang,Chengzhe Sun,Phil Rose,Siwei Lyu

类目：ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词：evaluated in terms, perceptual consequences, heavily accented Mandarin, Voice cloning, accented Mandarin speech

备注：

点击查看摘要

Abstract:Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.

65. 【2604.01560】DeltaMem: Towards Agentic Memory Management via Reinforcement Learning

链接：https://arxiv.org/abs/2604.01560

作者：Qi Zhang,Shen Huang,Chu Liu,Shouqing Yang,Junbo Zhao,Haobo Wang,Pengjun Xie

类目：Computation and Language (cs.CL)

关键词：Recent advances, managing persona memory, revealed the powerful, powerful capability, capability of multi-agent

备注： preprint, under review

点击查看摘要

Abstract:Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.

66. 【2604.01538】Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

链接：https://arxiv.org/abs/2604.01538

作者：Mengxian Lyu,Cheng Peng,Ziyi Chen,Mengyuan Zhang,Jieting Li Lu,Yonghui Wu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language models, reduce clinician burden, Large language, clinician burden, documentation to reduce

备注：

点击查看摘要

Abstract:Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.

67. 【2604.01535】Read More, Think More: Revisiting Observation Reduction for Web Agents

链接：https://arxiv.org/abs/2604.01535

作者：Masafumi Enomoto,Ryoma Obara,Haochen Zhang,Masafumi Oyamada

类目：Computation and Language (cs.CL)

关键词：planning subsequent steps, Web agents based, web pages, Web agents, large language models

备注：

点击查看摘要

Abstract:Web agents based on large language models (LLMs) rely on observations of web pages -- commonly represented as HTML -- as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.

68. 【2604.01514】Why Instruction-Based Unlearning Fails in Diffusion Models?

链接：https://arxiv.org/abs/2604.01514

作者：Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：models remains unclear, generative models remains, inference time, remains unclear, modifying the behavior

备注：

点击查看摘要

Abstract:Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.

69. 【2604.01504】Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once

链接：https://arxiv.org/abs/2604.01504

作者：Harnoor Dhingra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Large Language Models, Research on Large, Large Language, Language Models, representational analysis

备注： Under review

点击查看摘要

Abstract:Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of "diversity." Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model's intrinsic trait.

70. 【2604.01496】From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

链接：https://arxiv.org/abs/2604.01496

作者：Nikolai Ludwig,Wasi Uddin Ahmad,Somshubra Majumdar,Boris Ginsburg

类目：oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词：two-stage SFT recipe, open-weight frontier LLMs, distilling open-weight frontier, two-stage SFT, SFT recipe

备注：

点击查看摘要

Abstract:We introduce SWE-ZERO to SWE-HERO, a two-stage SFT recipe that achieves state-of-the-art results on SWE-bench by distilling open-weight frontier LLMs. Our pipeline replaces resource-heavy dependencies with an evolutionary refinement strategy: (1) SWE-ZERO utilizes large-scale, execution-free trajectories to master code semantics and repository-level reasoning, and (2) SWE-HERO applies targeted, execution-backed refinement to transition these semantic intuitions into rigorous engineering workflows. Our empirical results set a new benchmark for open-source models of comparable size. We release a dataset of 300k SWE-ZERO and 13k SWE-HERO trajectories distilled from Qwen3-Coder-480B, alongside a suite of agents based on the Qwen2.5-Coder series. Notably, SWE-HERO-32B achieves a 62.2% resolution rate on SWE-bench Verified. Furthermore, despite being trained exclusively on Python, our agents demonstrate robust zero-shot transferability on SWE-bench Multilingual, reaching 44.1% and confirming the paradigm's generalizability across diverse languages.

71. 【2604.01476】When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

链接：https://arxiv.org/abs/2604.01476

作者：Rui Wu,Ruixiang Tang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Reinforcement learning, learning for LLMs, LLMs is vulnerable, Reinforcement, intended task

备注： 15 pages, 8 figures

点击查看摘要

Abstract:Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.

72. 【2604.01467】A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning

链接：https://arxiv.org/abs/2604.01467

作者：Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：remembered through plot, Persian poetry, remembered, Wine vessels, recurrent symbols

备注：

点击查看摘要

Abstract:Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.

73. 【2604.01457】Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs

链接：https://arxiv.org/abs/2604.01457

作者：Tianyi Zhao,Yinhan He,Wendy Zheng,Yujie Zhang,Chen Chen

类目：Computation and Language (cs.CL)

关键词：Large language models, factually incorrect answers, produce factually incorrect, verbalize overly high, Large language

备注：

点击查看摘要

Abstract:Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.

74. 【2604.01432】Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

链接：https://arxiv.org/abs/2604.01432

作者：Hexuan Wang,Jingyu Zhang,Benjamin Van Durme,Daniel Khashabi(Johns Hopkins University)

类目：Computation and Language (cs.CL)

关键词：cite individual sentences, critical design choice, individual sentences, cite individual, critical design

备注：

点击查看摘要

Abstract:Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.

75. 【2604.01425】he power of context: Random Forest classification of near synonyms. A case study in Modern Hindi

链接：https://arxiv.org/abs/2604.01425

作者：Jacek Bąkowski

类目：Computation and Language (cs.CL)

关键词：puzzling linguistic phenomenon, linguistic phenomenon, widespread yet puzzling, puzzling linguistic, cs.CL

备注：

点击查看摘要

Abstract:Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language's expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.01425 [cs.CL]

(or
arXiv:2604.01425v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01425

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

76. 【2604.01418】Cost-Efficient Estimation of General Abilities Across Benchmarks

链接：https://arxiv.org/abs/2604.01418

作者：Michael Krumdick,Adam Wiemerslage,Seth Ebner,Charles Lovering,Chris Tanner

类目：Computation and Language (cs.CL)

关键词：developed to measure, large language models, measure the quality, Thousands of diverse, Thousands

备注：

点击查看摘要

Abstract:Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the "Wide-scale Item Level Dataset" (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model's performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.

77. 【2604.01417】ReFormeR: Learning and Applying Explicit Query Reformulation Patterns

链接：https://arxiv.org/abs/2604.01417

作者：Amin Bigdeli,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：reformulation, query reformulation, reformulation patterns, present ReFormeR, query

备注：

点击查看摘要

Abstract:We present ReFormeR, a pattern-guided approach for query reformulation. Instead of prompting a language model to generate reformulations of a query directly, ReFormeR first elicits short reformulation patterns from pairs of initial queries and empirically stronger reformulations, consolidates them into a compact library of transferable reformulation patterns, and then selects an appropriate reformulation pattern for a new query given its retrieval context. The selected pattern constrains query reformulation to controlled operations such as sense disambiguation, vocabulary grounding, or discriminative facet addition, to name a few. As such, our proposed approach makes the reformulation policy explicit through these reformulation patterns, guiding the LLM towards targeted and effective query reformulations. Our extensive experiments on TREC DL 2019, DL 2020, and DL Hard show consistent improvements over classical feedback methods and recent LLM-based query reformulation and expansion approaches.

78. 【2604.01413】Adaptive Stopping for Multi-Turn LLM Reasoning

链接：https://arxiv.org/abs/2604.01413

作者：Xiaofan Zhou,Huy Nguyen,Bo Yu,Chenxi Liu,Lu Cheng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, adaptive retrieval-augmented generation, retrieval-augmented generation, ReAct-style agents

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.

79. 【2604.01411】st-Time Scaling Makes Overtraining Compute-Optimal

链接：https://arxiv.org/abs/2604.01411

作者：Nicholas Roberts,Sungjun Cho,Zhiqi Gao,Tzu-Heng Huang,Albert Wu,Gabriel Orlanski,Avi Trost,Kelly Buchanan,Aws Albarghouthi,Frederic Sala

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词：repeated sampling, scaling, inference cost grows, scaling laws, pretraining

备注：

点击查看摘要

Abstract:Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.

80. 【2604.01410】Assessing Pause Thresholds for empirical Translation Process Research

链接：https://arxiv.org/abs/2604.01410

作者：Devi Sri Bandaru,Michael Carl,Xinyue Ren

类目：Computation and Language (cs.CL)

关键词：Text production, interrupted by keystroke, Production Unit Breaks, form of stretches, computing Production Unit

备注： Accepted for Presentation at "Translation in Transition 8, September 2026"

点击查看摘要

Abstract:Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O'Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.

81. 【2604.01404】Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

链接：https://arxiv.org/abs/2604.01404

作者：Itay Yona,Dan Barzilay,Michael Karasik,Mor Geva

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Language models, multiple language models, remains unclear, unclear which internal, internal mechanisms

备注：

点击查看摘要

Abstract:Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.

82. 【2604.01354】Open-Domain Safety Policy Construction

链接：https://arxiv.org/abs/2604.01354

作者：Di Wu,Siyue Liu,Zixiang Ji,Ya-Liang Chang,Zhe-Yu Liu,Andrew Pleffer,Kai-Wei Chang

类目：Computation and Language (cs.CL)

关键词：built on user, layers are increasingly, increasingly a core, core component, products built

备注： EACL 2026 (Findings)

点击查看摘要

Abstract:Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at this https URL.

83. 【2604.01350】No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents

链接：https://arxiv.org/abs/2604.01350

作者：Tiankai Yang,Jiate Li,Yi Nian,Shen Dong,Ruiyao Xu,Ryan Rossi,Kaize Ding,Yue Zhao

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词：LLM-based agents increasingly, maintaining task states, agents increasingly operate, repeated sessions, maintaining task

备注：

点击查看摘要

Abstract:LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.

84. 【2604.01348】Procedural Knowledge at Scale Improves Reasoning

链接：https://arxiv.org/abs/2604.01348

作者：Di Wu,Devendra Singh Sachan,Wen-tau Yih,Mingda Chen

类目：Computation and Language (cs.CL)

关键词：reasoning, procedural knowledge, challenging reasoning tasks, Reasoning Memory, knowledge

备注：

点击查看摘要

Abstract:Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

85. 【2604.01312】Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences

链接：https://arxiv.org/abs/2604.01312

作者：Simona-Vasilica Oprea,Adela Bâra

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：reward modeling relies, remains fundamentally challenging, fundamentally challenging, relies on subtle, subjective comparisons

备注：

点击查看摘要

Abstract:Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.

86. 【2604.01306】M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

链接：https://arxiv.org/abs/2604.01306

作者：Abolfazl Ansari,Delvin Ce Zhang,Zhuoyang Zou,Wenpeng Yin,Dongwon Lee

类目：Computation and Language (cs.CL)

关键词：Evaluating scientific arguments, arguments requires assessing, underlying multimodal evidence, scientific arguments requires, Evaluating scientific

备注： Preprint. Under Review

点击查看摘要

Abstract:Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

87. 【2604.01302】Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming

链接：https://arxiv.org/abs/2604.01302

作者：Qianfan Zhang,Tianyu Guo,Xuandi Ren,Jiale Chen,Ming Ding,Ran Xin,Xia Xiao

类目：Computation and Language (cs.CL)

关键词：training-time reinforcement learning, scale reasoning token, complementary approaches, training-time reinforcement, reinforcement learning

备注：

点击查看摘要

Abstract:We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model's oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.

88. 【2604.01280】Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

链接：https://arxiv.org/abs/2604.01280

作者：Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, requires combining visual, combining visual understanding, Multimodal Large Language, external knowledge

备注： Project Page: [this https URL](https://aimagelab.github.io/LoT/)

点击查看摘要

Abstract:Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

89. 【2604.01268】he Overlooked Repetitive Lengthening Form in Sentiment Analysis

链接：https://arxiv.org/abs/2604.01268

作者：Lei Wang,Eduard Dragut

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：frequently express personal, express personal opinions, Individuals engaging, Repetitive Lengthening Form, communication frequently express

备注： Findings of EMNLP 2024

点击查看摘要

Abstract:Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at this https URL

信息检索

1. 【2604.02211】Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

链接：https://arxiv.org/abs/2604.02211

作者：Srivaths Ranganathan,Abhishek Dharmaratnakar,Anushree Sinha,Debanshu Das

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词：shaping content consumption, Video recommender systems, recommender systems, shaping content, popular and impactful

备注： Accepted for publication in The Nineteenth ACM International Conference on Web Search and Data Mining (WSDM Companion 2026)

点击查看摘要

Abstract:Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

2. 【2604.02156】AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

链接：https://arxiv.org/abs/2604.02156

作者：Atilla Kaan Alkan,Felix Grezes,Sergi Blanco-Cuaresma,Jennifer Lynn Bartlett,Daniel Chivvis,Anna Kelbert,Kelly Lockhart,Alberto Accomazzi

类目：Computation and Language (cs.CL); Instrumentation and Methods for Astrophysics (astro-ph.IM); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：severe power-law distributions, Unified Astronomy Thesaurus, challenge standard classification, power-law distributions, distributions that challenge

备注： 9 pages, 2 figures

点击查看摘要

3. 【2604.02091】Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

链接：https://arxiv.org/abs/2604.02091

作者：Yuhang Wu,Xiangqing Shen,Fanfan Wang,Cangqi Zhou,Zhen Wu,Xinyu Dai,Rui Xia

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：refining retrieval results, play a pivotal, pivotal role, role in refining, results for Retrieval-Augmented

备注： 16 pages

点击查看摘要

4. 【2604.01965】Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

链接：https://arxiv.org/abs/2604.01965

作者：Florian Kelber,Matthias Jobst,Yuni Susanti,Michael Färber

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)

关键词：knowledge discovery increasingly, discovery increasingly relies, Scientific knowledge discovery, billions of parameters, knowledge discovery

备注： Accepted at NSLP@LREC 2026

点击查看摘要

5. 【2604.01957】Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

链接：https://arxiv.org/abs/2604.01957

作者：Klaudia Thellmann,Bernhard Stadler,Michael Färber

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：quality weaken confidence, Machine-translated benchmark datasets, uneven quality weaken, loss of structure, datasets reduce costs

备注： Accepted at LREC 2026

点击查看摘要

6. 【2604.01733】From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents

链接：https://arxiv.org/abs/2604.01733

作者：Meftun Akarsu,Recep Kaan Karaman,Christopher Mierbach

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：systems critically depend, systems critically, tabular data, critically depend, systematic comparison

备注： 11 pages, 6 figures, 6 tables

点击查看摘要

7. 【2604.01617】STABLE: Efficient Hybrid Nearest Neighbor Search via Magnitude-Uniformity and Cardinality-Robustness

链接：https://arxiv.org/abs/2604.01617

作者：Qianyun Yang,Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Liqiang Nie

类目：Information Retrieval (cs.IR)

关键词：Approximate Nearest Neighbor, Hybrid Approximate Nearest, Nearest Neighbor Search, Approximate Nearest, Nearest Neighbor

备注： Accepted by IEEE TKDE

点击查看摘要

Abstract:Hybrid Approximate Nearest Neighbor Search (Hybrid ANNS) is a foundational search technology for large-scale heterogeneous data and has gained significant attention in both academia and industry. However, current approaches overlook the heterogeneity in data distribution, thus ignoring two major challenges: the Compatibility Barrier for Similarity Magnitude Heterogeneity and the Tolerance Bottleneck to Attribute Cardinality. To overcome these issues, we propose the robuSt heTerogeneity-Aware hyBrid retrievaL framEwork, STABLE, designed for accurate, efficient, and robust hybrid ANNS under datasets with various distributions. Specifically, we introduce an enhAnced heterogeneoUs semanTic perceptiOn (AUTO) metric to achieve a joint measurement of feature similarity and attribute consistency, addressing similarity magnitude heterogeneity and improving robustness to datasets with various attribute cardinalities. Thereafter, we construct our Heterogeneous sEmantic reLation graPh (HELP) index based on AUTO to organize heterogeneous semantic relations. Finally, we employ a novel Dynamic Heterogeneity Routing method to ensure an efficient search. Extensive experiments on five feature vector benchmarks with various attribute cardinalities demonstrate the superior performance of STABLE.

8. 【2604.01417】ReFormeR: Learning and Applying Explicit Query Reformulation Patterns

链接：https://arxiv.org/abs/2604.01417

作者：Amin Bigdeli,Mert Incesu,Negar Arabzadeh,Charles L. A. Clarke,Ebrahim Bagheri

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：reformulation, query reformulation, reformulation patterns, present ReFormeR, query

备注：

点击查看摘要

9. 【2604.01262】ransforming OPACs into Intelligent Discovery Systems: An AI-Powered, Knowledge Graph-Driven Smart OPAC for Digital Libraries

链接：https://arxiv.org/abs/2604.01262

作者：M. S. Rajeevan,B. Mini Devi

类目：Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Public Access Catalogues, Online Public Access, Traditional Online Public, Access Catalogues, Online Public

备注： 8 pages, 4 tables, 6 figures presented at Intellib 2026 International Conference

点击查看摘要

Abstract:Traditional Online Public Access Catalogues (OPACs) are becoming less effective due to the rapid growth of scholarly literature. Conventional search methods, such as keyword indexing and Boolean queries, often fail to support efficient knowledge discovery. This paper proposes a Smart OPAC framework that transforms traditional OPACs into intelligent discovery systems using artificial intelligence and knowledge graph techniques. The framework enables semantic search, thematic filtering, and knowledge graph-based visualization to enhance user interaction and exploration. It integrates multiple open scholarly data sources and applies semantic embeddings to improve relevance and contextual understanding. The system supports exploratory search, semantic navigation, and refined result filtering based on user-defined themes. Quantitative evaluation demonstrates improvements in retrieval efficiency, relevance, and reduction of information overload. The proposed approach offers practical implications for modernizing digital library services and supports next-generation research workflows. Future work includes user-centric evaluation, personalization, and dynamic knowledge graph updates.

10. 【2604.01264】OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

链接：https://arxiv.org/abs/2604.01264

作者：Okan Uçar,Murat Kurt

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词：Magnetic Resonance Imaging, Medical imaging techniques, Magnetic Resonance, Resonance Imaging, imaging techniques

备注： 7 pages, 3 figures, 1 table

点击查看摘要

Abstract:Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named "OkanNet", which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49\%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10\%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.

计算机视觉

1. 【2604.02331】EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

链接：https://arxiv.org/abs/2604.02331

作者：Luca Bartolomei,Fabio Tosi,Matteo Poggi,Stefano Mattoccia,Guillermo Gallego

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：costly active sensors, ground truth annotations, standard color images, deep-event stereo networks, active sensors

备注： CVPR 2026. Project Page: [this https URL](https://bartn8.github.io/eventhub/) Code: [this https URL](https://github.com/bartn8/eventhub)

点击查看摘要

Abstract:We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.

2. 【2604.02330】ActionParty: Multi-Subject Action Binding in Generative Video Games

链接：https://arxiv.org/abs/2604.02330

作者：Alexander Pondaven,Ziyi Wu,Igor Gilitschenski,Philip Torr,Sergey Tulyakov,Fabio Pizzati,Aliaksandr Siarohin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Recent advances, simulating interactive environments, enabled the development, simulating interactive, Recent

备注： Project page: [this https URL](https://action-party.github.io/)

点击查看摘要

Abstract:Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

3. 【2604.02329】Generative World Renderer

链接：https://arxiv.org/abs/2604.02329

作者：Zheng-Hui Huang,Zhixiang Wang,Jiaming Tan,Ruihan Yu,Yidan Zhang,Bo Zheng,Yu-Lun Liu,Yung-Yu Chuang,Kaipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Scaling generative inverse, existing synthetic datasets, Scaling generative, scenarios is bottlenecked, limited realism

备注： Project page: [this https URL](https://alaya-studio.github.io/renderer/)

点击查看摘要

Abstract:Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

4. 【2604.02328】Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

链接：https://arxiv.org/abs/2604.02328

作者：Alex Costanzino,Pierluigi Zama Ramirez,Giuseppe Lisanti,Luigi Di Stefano

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：detection and segmentation, natively multiview, multimodal framework, anomaly detection, present ModMap

备注： Accepted at CVPR Findings 2026

点击查看摘要

Abstract:We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

5. 【2604.02327】Steerable Visual Representations

链接：https://arxiv.org/abs/2604.02327

作者：Jona Ruthardt,Manu Gaur,Deva Ramanan,Makarand Tapaswi,Yuki M. Asano

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Pretrained Vision Transformers, Pretrained Vision, Vision Transformers, MAE provide generic, MAE provide

备注： preprint

点击查看摘要

Abstract:Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

6. 【2604.02323】Beyond Referring Expressions: Scenario Comprehension Visual Grounding

链接：https://arxiv.org/abs/2604.02323

作者：Ruozhen He,Nisarg A. Shah,Qihua Dong,Zilin Xiao,Jaywon Koo,Vicente Ordonez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing visual grounding, prominent named category, primarily evaluate alignment, literal referring expressions, benchmarks primarily evaluate

备注： 20 pages, 18 figures, Project Page: [this https URL](https://catherine-r-he.github.io/RSC/)

点击查看摘要

Abstract:Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

7. 【2604.02320】Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

链接：https://arxiv.org/abs/2604.02320

作者：Junxuan Li,Rawal Khirodkar,Chengan He,Zhongshi Jiang,Giljoo Nam,Lingchen Yang,Jihyun Lee,Egor Zakharov,Zhaoen Su,Rinat Abdrashitov,Yuan Dong,Julieta Martinez,Kai Li,Qingyang Tan,Takaaki Shiratori,Matthew Hu,Peihong Guo,Xuhua Huang,Ariyan Zarei,Marco Pesavento,Yichen Xu,He Wen,Teng Deng,Wyatt Borsos,Anjali Thakrar,Jean-Charles Bazin,Carsten Stoll,Ginés Hidalgo,James Booth,Lucy Wang,Xiaowen Ma,Yu Rong,Sairanjith Thalanki,Chen Cao,Christian Häne,Abhishek Kar,Sofien Bouaziz,Jason Saragih,Yaser Sheikh,Shunsuke Saito

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：avatar modeling faces, faces a critical, critical trade-off, avatar modeling, modeling faces

备注： Accepted in CVPR2026. Website: [this https URL](https://junxuan-li.github.io/lca)

点击查看摘要

Abstract:High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

8. 【2604.02318】Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning

链接：https://arxiv.org/abs/2604.02318

作者：Xueying Li,Feng Lyu,Hao Wu,Mingliu Liu,Jia-Nan Liu,Guozi Liu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：Training-free Vision-Language Navigation, Training-free Vision-Language, instructions and explore, powered by foundation, foundation models

备注： 10 pages, 6 figures

点击查看摘要

Abstract:Training-free Vision-Language Navigation (VLN) agents powered by foundation models can follow instructions and explore 3D environments. However, existing approaches rely on greedy frontier selection and passive spatial memory, leading to inefficient behaviors such as local oscillation and redundant revisiting. We argue that this stems from a lack of metacognitive capabilities: the agent cannot monitor its exploration progress, diagnose strategy failures, or adapt accordingly. To address this, we propose MetaNav, a metacognitive navigation agent integrating spatial memory, history-aware planning, and reflective correction. Spatial memory builds a persistent 3D semantic map. History-aware planning penalizes revisiting to improve efficiency. Reflective correction detects stagnation and uses an LLM to generate corrective rules that guide future frontier selection. Experiments on GOAT-Bench, HM3D-OVON, and A-EQA show that MetaNav achieves state-of-the-art performance while reducing VLM queries by 20.7%, demonstrating that metacognitive reasoning significantly improves robustness and efficiency.

9. 【2604.02317】A Simple Baseline for Streaming Video Understanding

链接：https://arxiv.org/abs/2604.02317

作者：Yujiao Shen,Shulin Tian,Jingkang Yang,Ziwei Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understanding methods increasingly, methods increasingly rely, long video streams, handle long video, video understanding methods

备注： Project page: [this https URL](https://simple-stream.github.io/)

点击查看摘要

Abstract:Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.

10. 【2604.02296】VOID: Video Object and Interaction Deletion

链接：https://arxiv.org/abs/2604.02296

作者：Saman Motamed,William Harvey,Benjamin Klein,Luc Van Gool,Zhuoning Yuan,Ta-Ying Cheng

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：correcting appearance-level artifacts, Existing video object, video object removal, Existing video, object removal

备注：

点击查看摘要

Abstract:Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

11. 【2604.02290】AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging

链接：https://arxiv.org/abs/2604.02290

作者：Qiang Ma,Qingjie Meng,Xin Hu,Yicheng Wu,Wenjia Bai

类目：Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词：Surface registration, anatomical shape analysis, medical imaging, Surface registration plays, plays an important

备注：

点击查看摘要

Abstract:Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.

12. 【2604.02289】Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

链接：https://arxiv.org/abs/2604.02289

作者：Chongjie Ye,Cheng Cao,Chuanyu Pan,Yiming Hao,Yihao Zhi,Yuanming Hu,Xiaoguang Han

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：remains challenging due, achieved strong performance, Recent multimodal large, large language models, multimodal large language

备注：

点击查看摘要

Abstract:Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

13. 【2604.02282】Deep Neural Network Based Roadwork Detection for Autonomous Driving

链接：https://arxiv.org/abs/2604.02282

作者：Sebastian Wullrich,Nicolai Steinke,Daniel Goehring

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：create major challenges, human drivers due, sites create major, heterogeneous nature, Road construction sites

备注： 7 pages, 10 figures

点击查看摘要

Abstract:Road construction sites create major challenges for both autonomous vehicles and human drivers due to their highly dynamic and heterogeneous nature. This paper presents a real-time system that detects and localizes roadworks by combining a YOLO neural network with LiDAR data. The system identifies individual roadwork objects while driving, merges them into coherent construction sites and records their outlines in world coordinates. The model training was based on an adapted US dataset and a new dataset collected from test drives with a prototype vehicle in Berlin, Germany. Evaluations on real-world road construction sites showed a localization accuracy below 0.5 m. The system can support traffic authorities with up-to-date roadwork data and could enable autonomous vehicles to navigate construction sites more safely in the future.

14. 【2604.02280】Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency

链接：https://arxiv.org/abs/2604.02280

作者：Payal Fofadiya,Sunil Tiwari

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：agents require persistent, conversational agents require, false memory propagation, require persistent memory, agents require

备注：

点击查看摘要

Abstract:Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.

15. 【2604.02265】Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

链接：https://arxiv.org/abs/2604.02265

作者：Yaoteng Tan,Zikui Cai,M. Salman Asif

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Controlling the behavior, practical deployment, critical for safe, safe and practical, Controlling

备注：

点击查看摘要

Abstract:Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

16. 【2604.02252】SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

链接：https://arxiv.org/abs/2604.02252

作者：Naomi Kombol,Ivan Martinović,Siniša Šegvić,Giorgos Tolias

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Foundational Vision Transformers, Foundational Vision, Vision Transformers, coarse patch-level representations, inherently coarse patch-level

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: this https URL

17. 【2604.02241】UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

链接：https://arxiv.org/abs/2604.02241

作者：Qiyao Zhang,Shuhua Zheng,Jianli Sun,Chengxiang Li,Xianke Wu,Zihan Song,Zhiyong Cui,Yisheng Lv,Yonglin Tian

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, Embodied visual tracking, executing complex real-world

备注：

点击查看摘要

Abstract:Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $\pi_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $\pi_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: this https URL\_VLA.

18. 【2604.02222】SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

链接：https://arxiv.org/abs/2604.02222

作者：Soroush Oraki,Feng Ding,Jie Liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Zero-shot skeleton-based action, skeleton-based action recognition, recognize action classes, Zero-shot skeleton-based, aims to recognize

备注： Accepted to ICPR 2026

点击查看摘要

Abstract:Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.

19. 【2604.02190】UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

链接：https://arxiv.org/abs/2604.02190

作者：Yongkang Li,Lijun Zhou,Sixu Yan,Bencheng Liao,Tianyi Yan,Kaixin Xiong,Long Chen,Hongwei Xie,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Haiyang Sun,Xinggang Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：leveraging rich world, rich world knowledge, recently emerged, promise of leveraging, leveraging rich

备注： code has been released at [this https URL](https://github.com/xiaomi-research/unidrivevla)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at this https URL

20. 【2604.02188】Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention

链接：https://arxiv.org/abs/2604.02188

作者：Sorna Shanmuga Raja,Abdelhafid Zenati

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：jointly captures spatial, Point Instance Network, presents a lightweight, real-world driving scenarios, Feature Pyramid Network

备注：

点击查看摘要

Abstract:This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.02188 [cs.CV]

(or
arXiv:2604.02188v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.02188

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

21. 【2604.02185】CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification

链接：https://arxiv.org/abs/2604.02185

作者：Juno Cho(1),Dohui Kim(2),Mingeon Kim(1),Hyunseo Jang(3),Chang Sun Lee(4),Jong Chul Ye(4) ((1) KAIST, (2) GIST, (3) Korea University, (4) KAIST Graduate School of AI)

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenge tackles multi-label, chest X-ray, tackles multi-label classification, challenge tackles, tackles multi-label

备注： 5 pages, 3 figures. Accepted to the IEEE ISBI 2026 CXR-LT Challenge

点击查看摘要

Abstract:This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.

22. 【2604.02182】ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline

链接：https://arxiv.org/abs/2604.02182

作者：Juan Manuel Hernandez,Mariana Fernandez-Espinosa,Denis Parra,Diego Gomez-Zara

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：natural language processing, Transformer-based architectures, shared backbone, backbone of natural, natural language

备注： 7 pages, 4 figures

点击查看摘要

Abstract:Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.

23. 【2604.02168】Reflection Generation for Composite Image Using Diffusion Model

链接：https://arxiv.org/abs/2604.02168

作者：Haonan Zhao,Qingyang Liu,Jiaxuan Chen,Li Niu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Image composition involves, composition involves inserting, synthesizing environment-consistent effects, Image composition, composition involves

备注：

点击查看摘要

Abstract:Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.

24. 【2604.02162】Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation

链接：https://arxiv.org/abs/2604.02162

作者：Saurabh Hinduja,Gurmeet Kaur,Maneesh Bilalpur,Jeffrey Cohn,Shaun Canavan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：facial Action Unit, Action Unit, facial Action, standard evaluation protocol, standard evaluation

备注： CVPR 2026

点击查看摘要

Abstract:Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings

Comments:
CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.02162 [cs.CV]

(or
arXiv:2604.02162v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.02162

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

25. 【2604.02160】CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection

链接：https://arxiv.org/abs/2604.02160

作者：Weidong Tang,Hanbin Sun,Zihan Li,Yikai Wang,Feifan Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Remote sensing change, arbitrary user-defined queries, fixed label space, answer arbitrary user-defined, sensing change detection

备注：

点击查看摘要

Abstract:Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.

26. 【2604.02103】CASHG: Context-Aware Stylized Online Handwriting Generation

链接：https://arxiv.org/abs/2604.02103

作者：Jinsu Shin,Sungeun Hong,Jin Yeong Bak

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：makes handwritten content, handwritten content easier, Online handwriting represents, handwriting represents strokes, Online handwriting

备注： 42 pages, 19 figures

点击查看摘要

Abstract:Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer's style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor--current transitions, complemented by gated context fusion for sentence-level this http URL proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.

27. 【2604.02097】LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

链接：https://arxiv.org/abs/2604.02097

作者：Jiachun Jin,Zetong Zhou,Xiao Yang,Hao Zhang,Pengfei Liu,Jun Zhu,Zhijie Deng

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：hold promise, ability to understand, understand and generate, visual, generate content

备注：

点击查看摘要

Abstract:Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.

28. 【2604.02093】GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

链接：https://arxiv.org/abs/2604.02093

作者：Rong Fan,Kaiyan Xiao,Minghao Zhu,Liuyi Wang,Kai Dai,Zhao Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：extending video large, video large language, large language models, broader applications, Video temporal grounding

备注： Published as a conference paper at CVPR 2026

点击查看摘要

Abstract:Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at this https URL.

29. 【2604.02090】Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology

链接：https://arxiv.org/abs/2604.02090

作者：Yan Kong,Yuan Yin,Hongan Chen,Yuqi Fang,Caifeng Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains challenging due, dense cell distribution, cervical cancer screening, Pap smear images, Pap smear

备注： ISBI 2026 Accepted Paper Winning Solution for the RIVA Cervical Cytology Challenge

点击查看摘要

Abstract:Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset's unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at this https URL.

30. 【2604.02088】FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition

链接：https://arxiv.org/abs/2604.02088

作者：Taichi Endo,Guoqing Hao,Kazuhiko Sumi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：consistent edit direction, image editing aims, preserving source-image fidelity, preserving source-image, maintaining a consistent

备注： HuggingFace Space: [this https URL](https://huggingface.co/spaces/dominoer/FlowSlider)

点击查看摘要

Abstract:Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit's update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.

31. 【2604.02073】PLUME: Latent Reasoning Based Universal Multimodal Embedding

链接：https://arxiv.org/abs/2604.02073

作者：Chenwei He,Xiangzhao Hao,Tianyu Yang,Yuxiang Ma,Yuheng Jia,Lingxiang Wu,Chaoyang Zhao,Haiyun Guo,Jinqiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Universal multimodal embedding, maps heterogeneous inputs, Universal multimodal, shared retrieval space, maps heterogeneous

备注：

点击查看摘要

Abstract:Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

32. 【2604.02071】Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

链接：https://arxiv.org/abs/2604.02071

作者：Soo Won Seo,KyungChae Lee,Hyungchan Cho,Taein Son,Nam Ik Cho,Jun Won Choi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：localize human-object pairs, demands strong visual, strong visual understanding, single image, localize human-object

备注： Accepted to CVPR 2026. Code: [this https URL](https://github.com/nowuss/InCoM-Net)

点击查看摘要

Abstract:Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at this https URL.

33. 【2604.02068】Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement

链接：https://arxiv.org/abs/2604.02068

作者：Aditya Humnabadkar

类目：Computer Vision and Pattern Recognition (cs.CV); Econometrics (econ.EM)

关键词：bilateral measurement approaches, traditional bilateral measurement, economic relationships invisible, inter-industry payment flows, payment flows reveals

备注： Accepted for Poster presentation at the ESCoE Conference on Economic Measurement 2026

点击查看摘要

Abstract:Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017--2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5\% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.

34. 【2604.02060】CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

链接：https://arxiv.org/abs/2604.02060

作者：Jingliang Li,Jindou Jia,Tuo An,Chuhao Zhou,Xiangyu Chen,Shilin Shan,Boyu Ma,Bofan Lyu,Gen Li,Jianfei Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：cut the apple, nearby scissors, cutting function, robot must choose, choose the knife

备注： Code available at: [this http URL](http://github.com/Lorenzo-0-0/CompassAD)

点击查看摘要

Abstract:When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

35. 【2604.02056】COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

链接：https://arxiv.org/abs/2604.02056

作者：Hao Wang,Yanyu Qian,Pengcheng Weng,Zixuan Xia,William Dan,Yangxin Xu,Fei Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dropping absent branches, reconstructing missing features, existing methods adapt, absent branches, Missing modalities remain

备注：

点击查看摘要

Abstract:Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.

36. 【2604.02055】rue to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines

链接：https://arxiv.org/abs/2604.02055

作者：Gabriel Ferri Schneider,Erick Menezes,Rafael Mecenas,Paulo Knob,Victor Araujo,Soraia Raupp Musse

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Virtual Human, Accurate reproduction, fairness in Virtual, identity preservation, essential for realism

备注： 20 pages, 10 figures

点击查看摘要

Abstract:Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $\Delta E$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.

37. 【2604.02048】Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models

链接：https://arxiv.org/abs/2604.02048

作者：Issa Sugiura,Keito Sasagawa,Keisuke Nakao,Koki Maeda,Ziqi Yin,Zhishen Yang,Shuhei Kurita,Yusuke Oda,Ryoko Tokuhisa,Daisuke Kawahara,Naoaki Okazaki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Developing vision-language models, Developing vision-language, tasks requires large-scale, requires large-scale training, requires large-scale

备注： 18 pages, 7 figures

点击查看摘要

Abstract:Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.

38. 【2604.02040】Efficient Reasoning via Thought Compression for Language Segmentation

链接：https://arxiv.org/abs/2604.02040

作者：Qing Zhou,Shiyu Zhang,Yuyu Jia,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：limits real-world applicability, prohibitive computational cost, large multimodal models, generating verbose rationales, language-guided segmentation

备注：

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice -- once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user's query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} -- from 112 to just 23 tokens. Code is available at \href{this https URL}{WISE}.

39. 【2604.02032】IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline

链接：https://arxiv.org/abs/2604.02032

作者：Sebastian-Ion Nae,Radu Moldoveanu,Alexandra Stefania Ghita,Adina Magda Florea

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Understanding human behaviour, rarely capture real-world, capture real-world indoor, crowded indoor environments, real-world indoor complexity

备注： Accepted at Conference on Computer Vision and Pattern Recognition Workshops 2026

点击查看摘要

Abstract:Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's $\kappa$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3\%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at this https URL.

40. 【2604.02031】Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

链接：https://arxiv.org/abs/2604.02031

作者：Alejandro Castañeda Garcia,Jan van Gemert,Daan Brinks,Nergis Tömen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：spatially non-uniform sampling, challenged by spatially, spatially non-uniform, non-uniform sampling, spatial imbalance

备注：

点击查看摘要

Abstract:Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.

41. 【2604.02020】Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence

链接：https://arxiv.org/abs/2604.02020

作者：Dian Liu,Jie Feng,Di Li,Yuhui Zheng,Guanbin Li,Weisheng Dong,Guangming Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：uniquely integrates macro-scale, integrates macro-scale global, macro-scale global coverage, Synergistic spatial intelligence, real-time local perception

备注：

点击查看摘要

Abstract:Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs' wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.

42. 【2604.02010】Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation

链接：https://arxiv.org/abs/2604.02010

作者：Jie Feng,Fengze Li,Junpeng Zhang,Siyu Chen,Yuping Liang,Junying Chen,Ronghua Shang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fine-grained spatial delineation, Open-vocabulary semantic segmentation, Open-vocabulary semantic, remote sensing, field requires

备注：

点击查看摘要

Abstract:Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.

43. 【2604.02009】st-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models

链接：https://arxiv.org/abs/2604.02009

作者：Osher Rafaeli,Tal Svoray,Ariel Nahlieli

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate digital surface, including urban monitoring, Accurate digital, digital surface models, environmental analyses

备注：

点击查看摘要

Abstract:Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.

44. 【2604.02003】ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

链接：https://arxiv.org/abs/2604.02003

作者：Sirshapan Mitra,Yogesh S. Rawat

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：missing intermediate observations, Generating ground-level views, site models, missing intermediate, intermediate observations

备注：

点击查看摘要

Abstract:Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.

45. 【2604.01995】MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction

链接：https://arxiv.org/abs/2604.01995

作者：Chen Liu,Hengyu Man,Xiaopeng Fan,Debin Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pixel-level tasks simultaneously, dense prediction aims, Multi-task dense prediction, perform multiple pixel-level, multiple pixel-level tasks

备注： accepted by ICME 2026, to be published

点击查看摘要

Abstract:Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.

46. 【2604.01994】Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation

链接：https://arxiv.org/abs/2604.01994

作者：Changshe Zhang,Jie Feng,Siyu Chen,Guanbin Li,Ronghua Shang,Junpeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：online video diffusion, scenes remains constrained, computational cost exceeds, overlooked contradiction, reliable motion supervision

备注：

点击查看摘要

Abstract:Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35\,GB to around 20\,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.

47. 【2604.01989】Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

链接：https://arxiv.org/abs/2604.01989

作者：Boyang Gong,Yu Zheng,Fanye Kong,Jie Zhou,Jiwen Lu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large language models, remaining largely static, multimodal large language, early decoding steps, exhibits pronounced inertia

备注：

点击查看摘要

Abstract:Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

48. 【2604.01987】Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models

链接：https://arxiv.org/abs/2604.01987

作者：Antoine Saporta,Baptiste Callard,Corentin Dancette,Julien Khlaut,Charles Corbière,Leo Butsanets,Amaury Prat,Pierre Manceron

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：development of Foundation, Foundation Models, reduce the growing, unsustainable workload, workload on radiologists

备注：

点击查看摘要

Abstract:The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.

49. 【2604.01974】Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation

链接：https://arxiv.org/abs/2604.01974

作者：Yuqing Huang,Guotian Zeng,Zhenqiao Yuan,Zhenyu He,Xin Li,Yaowei Wang,Ming-Hsuan Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Existing visual trackers, Existing visual, making them impractical, impractical for real-world, Existing

备注：

点击查看摘要

Abstract:Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at this https URL.

50. 【2604.01973】NearID: Identity Representation Learning via Near-identity Distractors

链接：https://arxiv.org/abs/2604.01973

作者：Aleksandar Cvejic,Rameen Abdal,Abdelrahman Eldesokey,Bernard Ghanem,Peter Wonka

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：evaluating identity-focused tasks, existing vision encoders, vision encoders entangle, encoders entangle object, entangle object identity

备注： Code at [this https URL](https://github.com/Gorluxor/NearID)

点击查看摘要

Abstract:When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity NearID distractor random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: this https URL

51. 【2604.01972】SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions

链接：https://arxiv.org/abs/2604.01972

作者：Jie Feng,Jiawei Shen,Junjia Huang,Junpeng Zhang,Mingtao Feng,Weisheng Dong,Guanbin Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：http URL, indoor scene generation, labor-intensive layout specification, short textual descriptions, indoor scene

备注：

点击查看摘要

Abstract:3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial this http URL by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual this http URL, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic this http URL, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via this http URL experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene this http URL will be publicly available.

52. 【2604.01966】Ego-Grounding for Personalized Question-Answering in Egocentric Videos

链接：https://arxiv.org/abs/2604.01966

作者：Junbin Xiao,Shenglang Zhang,Pengxiang Zhu,Angela Yao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：multimodal large language, personalized question-answering requiring, question-answering requiring ego-grounding, ability to understand, systematic analysis

备注： To appear at CVPR'26

点击查看摘要

Abstract:We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at this https URL

53. 【2604.01964】Automated Prostate Gland Segmentation in MRI Using nnU-Net

链接：https://arxiv.org/abs/2604.01964

作者：Pablo Rodriguez-Belenguer,Gloria Ribas,Javier Aquerreta Escribano,Rafael Moreno-Calatayud,Leonor Cerda-Alberich,Luis Marti-Bonmati

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：including image registration, multiparametric MRI, Accurate segmentation, volume estimation, image registration

备注： 9 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.

54. 【2604.01958】MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction

链接：https://arxiv.org/abs/2604.01958

作者：Xilai Li,Weijun Jiang,Xiaosong Li,Yang Liu,Hongbin Wang,Tao Ye,Huafeng Li,Haishu Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce semantically rich, rich fusion results, semantically rich fusion, video fusion combines, combines the object

备注：

点击查看摘要

Abstract:Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16\,FPS at $640 \times 480$ resolution. The source code will be available at this https URL.

55. 【2604.01947】A Self supervised learning framework for imbalanced medical imaging datasets

链接：https://arxiv.org/abs/2604.01947

作者：Yash Kumar Sharma,Charan Ramtej Kodi,Vineet Padmanabhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Non-availability of large, labeled training data, medical image classification, plague medical imaging, frequent classes

备注：

点击查看摘要

Abstract:Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.

56. 【2604.01941】Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm

链接：https://arxiv.org/abs/2604.01941

作者：Sixing Li,Zhibin Gu,Ziqi Zhang,Weiguo Pan,Bing Li,Ying Wang,Hongzhe Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Early Childhood Education, Childhood Education, Early Childhood, automated activity understanding, essential for automated

备注：

点击查看摘要

Abstract:Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.01941 [cs.CV]

(or
arXiv:2604.01941v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.01941

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

57. 【2604.01934】Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain

链接：https://arxiv.org/abs/2604.01934

作者：Yimin Fu,Songbo Wang,Feiyan Wu,Jialin Lyu,Zhunga Liu,Michael K. Ng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：accurate target-background separation, highly depends, accurate target-background, target-background separation, infrared small

备注： The code will be released at [this https URL](https://github.com/fuyimin96/S2CPNet) upon acceptance

点击查看摘要

Abstract:The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.

58. 【2604.01921】Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision

链接：https://arxiv.org/abs/2604.01921

作者：George Sebastian,Philipp Berthold,Bianca Forkel,Leon Pohl,Mirko Maehlisch

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：applying learning-based models, perception pipelines commonly, pipelines commonly construct, radar perception pipelines, Automotive radar perception

备注：

点击查看摘要

Abstract:Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird's-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.

59. 【2604.01915】Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts

链接：https://arxiv.org/abs/2604.01915

作者：Yifan Gao,Tao Zhou,Yi Zhou,Ke Zou,Yizhe Zhang,Huazhu Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：support clinical decision-making, identify diagnostically relevant, free-text radiology reports, providing interpretable visual, interpretable visual evidence

备注： 10 pages, 6 figures

点击查看摘要

Abstract:Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.

60. 【2604.01909】Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching

链接：https://arxiv.org/abs/2604.01909

作者：Virmarie Maquiling,Yasmeen Abdrabou,Enkelejda Kasneci

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：Corneal reflection, making reproducibility difficult, pupil-corneal reflection, hardware setups, plays an important

备注： 6 pages, 3 figures, 2 algorithms, ETRA26

点击查看摘要

Abstract:Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.

61. 【2604.01907】Lifting Unlabeled Internet-level Data for 3D Scene Understanding

链接：https://arxiv.org/abs/2604.01907

作者：Yixin Chen,Yaowei Zhang,Huangyue Yu,Junchao He,Yan Wang,Jiangyong Huang,Hongyu Shen,Junfeng Ni,Shaofei Wang,Baoxiong Jia,Song-Chun Zhu,Siyuan Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：abundant unlabeled videos, expensive to acquire, scarce and expensive, Visual Question Answering, unlabeled videos

备注： CVPR 2026. Project page: [this https URL](https://sv-pp.github.io/)

点击查看摘要

Abstract:Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

62. 【2604.01903】Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition

链接：https://arxiv.org/abs/2604.01903

作者：Pan Yi,Weijie Li,Xiaodong Chen,Jiehua Zhang,Li Liu,Yongxiang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, military reconnaissance, disaster monitoring

备注： 16 pages, 8 figures, accepted by JSTARS

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.

63. 【2604.01900】FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation

链接：https://arxiv.org/abs/2604.01900

作者：Xilai Li,Chusheng Fang,Xiaosong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low-light monitoring, plays a critical, critical role, role in intelligent, intelligent surveillance

备注：

点击查看摘要

Abstract:Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at this https URL.

64. 【2604.01894】SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes

链接：https://arxiv.org/abs/2604.01894

作者：Panagiotis Sapoutzoglou,George Terzakis,Maria Pateraki

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

关键词：Spherical Harmonic Transform, Fast Spherical Harmonic, Spherical Harmonic, synthesizes arbitrary, genus-agnostic shapes

备注： Accepted at ICPR 2026

点击查看摘要

Abstract:We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at this https URL.

65. 【2604.01893】ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

链接：https://arxiv.org/abs/2604.01893

作者：Ke Li,Ting Wang,Di Wang,Yongshan Zhu,Yiming Zhang,Tao Lei,Quan Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aims to localize, remote sensing imagery, Remote sensing, natural language expressions, textit

备注：

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.

66. 【2604.01888】Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters

链接：https://arxiv.org/abs/2604.01888

作者：Ahmed B Mustafa,Zihan Ye,Yang Lu,Michael P Pound,Shreyank N Gowda

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：online platforms, widely deployed, deployed in creative, creative tools, tools and online

备注： Text-to-Image version of the Anyone can Jailbreak paper. Accepted in CVPR-W AIMS 2026

点击查看摘要

Abstract:Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.

67. 【2604.01884】GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting

链接：https://arxiv.org/abs/2604.01884

作者：Xianben Yang,Tao Wang,Yuxuan Li,Yi Jin,Haibin Ling

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated breakthrough performance, Gaussian Splatting, Gaussian, Gaussian points, demonstrated breakthrough

备注：

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5\% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.

68. 【2604.01882】A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes

链接：https://arxiv.org/abs/2604.01882

作者：Di Li,Jie Feng,Guanbin Li,Ronghua Shang,Yuhui Zheng,Weisheng Dong,Guangming Shi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian scenes aims, Affordance reasoning, aims to identify, identify the region, region that supports

备注：

点击查看摘要

Abstract:Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.

69. 【2604.01881】HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

链接：https://arxiv.org/abs/2604.01881

作者：Yansong Guo,Chaoyang Zhu,Jiayi Ji,Jianghang Lin,Liujuan Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Language Models, demonstrated impressive capabilities, significant computational burden, Language Models, Video Large Language

备注：

点击查看摘要

70. 【2604.01869】GeoAI Agency Primitives

链接：https://arxiv.org/abs/2604.01869

作者：Akram Zaytar,Rohan Sawahn,Caleb Robinson,Gilles Q. Hacheme,Girmaw A. Tadesse,Inbal Becker-Reshef,Rahul Dodhia,Juan Lavista Ferres

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：connect Foundation models, present ongoing research, connect Foundation, GeoAI assistants, Foundation models

备注：

点击查看摘要

Abstract:We present ongoing research on agency primitives for GeoAI assistants -- core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer -- including navigation, perception, geo-referenced memory, and dual modeling -- along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.

71. 【2604.01864】MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation

链接：https://arxiv.org/abs/2604.01864

作者：Kai Dong,Tingting Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated significant success, demonstrated significant, significant success, model, Abstract

备注： Accepted by AMME 2025

点击查看摘要

Abstract:Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model's internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model's performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.

72. 【2604.01859】Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

链接：https://arxiv.org/abs/2604.01859

作者：Hinako Mitsuoka,Kazuhiro Hotta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hinder practical deployment, Temporal Action Segmentation, Recent progress, Temporal Action, Action Segmentation

备注： Accepted by CVPR2026 Workshop "AI-driven Skilled Activity Understanding, Assessment Feedback Generation (SAUAFG)"

点击查看摘要

Abstract:Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

73. 【2604.01848】Semantic Richness or Geometric Reasoning? The Fragility of VLM's Visual Invariance

链接：https://arxiv.org/abs/2604.01848

作者：Jason Qiu,Zachary Meurer,Xavier Thomas,Deepti Ghadiyaram

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：basic geometric transformations, work investigates, Vision-Language Models, fundamental fragility, basic geometric

备注：

点击查看摘要

Abstract:This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.

74. 【2604.01844】FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting

链接：https://arxiv.org/abs/2604.01844

作者：Pawel Tomasz Pieta,Rasmus Juul Pedersen,Sina Borgi,Jakob Sauer Jørgensen,Jens Wenzel Andreasen,Vedrana Andersen Dahl

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：X-ray Computed Tomography, Computed Tomography, X-ray Computed, Gaussian Splatting, dominating technique

备注：

点击查看摘要

Abstract:Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: this https URL.

75. 【2604.01843】Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

链接：https://arxiv.org/abs/2604.01843

作者：Jamie S. J. Stirling,Noura Al-Moubayed,Hubert P. H. Shum

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：learn discrete neural, Vector quantization approaches, inherently position-dependent, contextually entangled, requiring autoregressive

备注： 15 pages plus references; 5 figures; supplementary appended; accepted to ICPR 2026

点击查看摘要

Abstract:Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.

76. 【2604.01836】Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers

链接：https://arxiv.org/abs/2604.01836

作者：Mohammadreza Heidarianbaei,Max Mehltretter,Franz Rottensteiner

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：jointly represent geometry, irregular structure poses, structure poses significant, poses significant challenges, meshes jointly represent

备注：

点击查看摘要

Abstract:Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9\% mF1 and 94.3\% OA on SUM and 49.7\% mF1 and 72.8\% OA on the new dataset, substantially outperforming existing approaches.

77. 【2604.01834】Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification

链接：https://arxiv.org/abs/2604.01834

作者：Shota Harada,Ryoma Bise,Kiyohito Tanaka,Seiichi Uchida

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：medical image analysis, Semi-supervised domain adaptation, making it promising, image analysis, Semi-supervised domain

备注：

点击查看摘要

Abstract:Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.

78. 【2604.01833】Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

链接：https://arxiv.org/abs/2604.01833

作者：Yaxin Luo,Zhiqiang Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：pre-training models differs, models differs significantly, pre-training models, language pre-training models, differs significantly

备注：

点击查看摘要

79. 【2604.01826】SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers

链接：https://arxiv.org/abs/2604.01826

作者：Xiang Yang,Feifei Li,Mi Zhang,Geng Hong,Xiaoyu You,Min Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high generative fidelity, achieve high generative, rectified-flow transformers, multi-token interactions, based on rectified-flow

备注： CVPR26

点击查看摘要

Abstract:Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at this https URL.

80. 【2604.01824】STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

链接：https://arxiv.org/abs/2604.01824

作者：Emad Bahrami,Olga Zatsarynna,Parth Pathak,Sunando Sengupta,Juergen Gall,Mohsen Fayyaz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：video question answering, Importance-aware Variant Exploration, STRIVE, reinforcement learning framework, question answering

备注：

点击查看摘要

Abstract:We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.

81. 【2604.01798】A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

链接：https://arxiv.org/abs/2604.01798

作者：Arezoo Borji,Gernot Kronreif,Bernhard Angermayr,Francisco Mario Calisto,Wolfgang Birkfellner,Inna Servetnyk,Yinyin Yuan,Sepideh Hatamikia

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：diverse molecular profiles, highly heterogeneous disease, classifying breast cancer, Breast cancer, heterogeneous disease

备注：

点击查看摘要

Abstract:Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from HE-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.

82. 【2604.01791】PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency

链接：https://arxiv.org/abs/2604.01791

作者：Leezy Han,Seunggyu Kim,Dongseok Shim,Hyeonbeom Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Monocular depth estimation, depth estimation, depth, widely adopted, perception systems

备注： Accepted at CVPR 2026

点击查看摘要

Abstract:Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.

83. 【2604.01777】GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

链接：https://arxiv.org/abs/2604.01777

作者：Mengtian Li,Fan Yang,Ruixue Xiong,Yiyan Fan,Zhifeng Xie,Zeyu Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Chinese classical gardens, hold great potential, style of Chinese, Chinese classical, Jiangnan gardens

备注： CVPR 2026, Project page: [this https URL](https://monad-cube.github.io/GardenDesigner)

点击查看摘要

Abstract:Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at this https URL.

84. 【2604.01766】FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation

链接：https://arxiv.org/abs/2604.01766

作者：Taimur Khan,Hannes Feilhauer,Muhammad Jazib Zafar

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：High Resolution, Foliage Height Diversity, Plant Area Index, forest structure data, Canopy Height Model

备注： Paper in-review

点击查看摘要

Abstract:Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

85. 【2604.01765】DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

链接：https://arxiv.org/abs/2604.01765

作者：Yang Zhou,Xiaofeng Wang,Hao Shao,Letian Wang,Guosheng Zhao,Jiangnan Shao,Jiagang Zhu,Tingdong Yu,Zheng Zhu,Guan Huang,Steven L. Waslander

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：spatio-temporal world modeling, emerged to bridge, unifying their reasoning, reasoning and instruction-following, instruction-following capabilities

备注： 11 pages, 4 figures; Project Website: [this https URL](https://drivedreamer-policy.github.io/)

点击查看摘要

Abstract:Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

86. 【2604.01764】Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning

链接：https://arxiv.org/abs/2604.01764

作者：Seyed Amir Kasaei,Arash Marioriyad,Mahbod Khaleti,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Vision-Language Models, achieved remarkable proficiency, Large Vision-Language, explicit visual recognition, effectively describing

备注： Accepted at ICLR 2026 Workshop: From Human Cognition to AI Reasoning (HCAIR)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at this https URL.

87. 【2604.01763】Cosine-Normalized Attention for Hyperspectral Image Classification

链接：https://arxiv.org/abs/2604.01763

作者：Muhammad Ahmad,Manuel Mazzara

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mechanisms typically rely, hyperspectral image classification, improved hyperspectral image, attention mechanisms typically, long-range spatial-spectral dependencies

备注：

点击查看摘要

Abstract:Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.

88. 【2604.01761】Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion

链接：https://arxiv.org/abs/2604.01761

作者：Edoardo A. Dominici,Thomas Deixelberger,Konstantinos Vardis,Markus Steinberger

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：world simulation, view synthesis, recently been applied, applied with success, success to problems

备注： project page [this https URL](https://dedoardo.github.io/projects/control-dino/)

点击查看摘要

Abstract:Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.

89. 【2604.01749】Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

链接：https://arxiv.org/abs/2604.01749

作者：Jiayun Jin,Haolong Chai,Xueying Huang,Xiaoqing Guo,Zengwei Zheng,Zhan Zhou,Junmei Wang,Xinyu Wang,Jie Liu,Binbin Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：clinical diagnostics due, radiation-free nature, imaging is widely, real-time capability, capability and radiation-free

备注： Accepted by CVPR 2026

点击查看摘要

Abstract:Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

90. 【2604.01747】Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception

链接：https://arxiv.org/abs/2604.01747

作者：Haoyuan Li,Wen Yang,Fang Xu,Hong Tan,Haijian Zhang,Shengyang Li,Gui-Song Xia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, remains challenging due, severe geometric discrepancy

备注： 15 pages, 10 figures

点击查看摘要

Abstract:Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird's-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.

91. 【2604.01742】Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation

链接：https://arxiv.org/abs/2604.01742

作者：Hongru Chen,Jiyang Huang,Jia Wan,Antoni B.Chan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：range of applications, including surveillance, surveillance and transportation, crucial task, wide range

备注：

点击查看摘要

Abstract:Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.

92. 【2604.01736】Setup-Independent Full Projector Compensation

链接：https://arxiv.org/abs/2604.01736

作者：Haibo Li,Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：seeks to correct, distortions that occur, occur when images, images are projected, projected onto nonplanar

备注： 16 pages,17 figures

点击查看摘要

Abstract:Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: this https URL

93. 【2604.01715】SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing

链接：https://arxiv.org/abs/2604.01715

作者：Thinh Dao,Zhen Wang,Kien T.Pham,Long Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：target conditional guidance, Recent advances, flow-based generative models, text-guided image editing, text-guided image

备注：

点击查看摘要

Abstract:Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.

94. 【2604.01714】End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement

链接：https://arxiv.org/abs/2604.01714

作者：Chihiro Nakatani,Norimichi Ukita,Jean-Marc Odobez

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shared attention estimation, shared attention, attention estimation, group, attention

备注： Accepted to CVPR2026 Workshop (GAZE 2026)

点击查看摘要

Abstract:This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: this https URL.

95. 【2604.01709】Bias mitigation in graph diffusion models

链接：https://arxiv.org/abs/2604.01709

作者：Meng Yu,Kun Zhan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：existing graph diffusion, significant bias problems, standard Gaussian distribution, standard Gaussian, graph diffusion models

备注： Accepted to ICLR 2025!

点击查看摘要

Abstract:Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion's maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art this http URL is at this https URL

96. 【2604.01700】Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation

链接：https://arxiv.org/abs/2604.01700

作者：Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Video frame interpolation, frame interpolation aims, realistic intermediate frames, synthesize realistic intermediate, Video frame

备注：

点击查看摘要

Abstract:Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.

97. 【2604.01693】From Understanding to Erasing: Towards Complete and Stable Video Object Removal

链接：https://arxiv.org/abs/2604.01693

作者：Dingming Liu,Wenjing Wang,Chen Li,Jing Lyu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：preserving spatio-temporal consistency, plausibly completing missing, completing missing regions, eliminate target objects, spatio-temporal consistency

备注：

点击查看摘要

Abstract:Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: this https URL.

98. 【2604.01679】BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography

链接：https://arxiv.org/abs/2604.01679

作者：Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：analyzing subtle appearance, subtle appearance variations, appearance variations induced, Remote photoplethysmography, enables contactless physiological

备注：

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.

99. 【2604.01678】Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding

链接：https://arxiv.org/abs/2604.01678

作者：Yuheng Jiang,Yiwen Cai,Zihao Wang,Yize Wu,Sicheng Li,Zhuo Su,Shaohui Jiao,Lan Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Volumetric video seeks, Volumetric video, video seeks, Volumetric, recent Gaussian-based approaches

备注： Project page: [this https URL](https://caiyw2023.github.io/Director/)

点击查看摘要

Abstract:Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.

100. 【2604.01676】GPA: Learning GUI Process Automation from Demonstrations

链接：https://arxiv.org/abs/2604.01676

作者：Zirui Zhao,Jun Hao Liew,Yan Yang,Wenzhuo Yang,Ziyang Luo,Doyen Sahoo,Silvio Savarese,Junnan Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词：Robotic Process Automation, GUI Process Automation, vision-based Robotic Process, Process Automation, general vision-based Robotic

备注：

点击查看摘要

Abstract:GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.

101. 【2604.01675】HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation

链接：https://arxiv.org/abs/2604.01675

作者：Ba-Thinh Nguyen,Thi-Duyen Ngo,Thanh-Trung Huynh,Thanh-Ha Le,Huy-Hieu Pham

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：enables non-contact physiological, non-contact physiological measurement, Remote photoplethysmography, substantial performance degradation, enables non-contact

备注：

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.

102. 【2604.01669】Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion

链接：https://arxiv.org/abs/2604.01669

作者：Juncen Guo,Xiaoguang Zhu,Jingyi Wu,Jingyu Zhang,Jingnan Cai,Zhenghao Niu,Liang Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：face severe challenges, perception systems face, systems face severe, open physical spaces, Embodied perception systems

备注： Accepted by ICME2026

点击查看摘要

Abstract:Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.

103. 【2604.01667】M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis

链接：https://arxiv.org/abs/2604.01667

作者：Rui Dong,Xiaotong Zhang,Jiaxing Li,Yueying Li,Jiayin Wei,Youyong Kong

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：downstream tasks, great significance, significance in neuroscience, neuroscience which integrates, integrates information

备注：

点击查看摘要

Abstract:Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.

104. 【2604.01666】DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

链接：https://arxiv.org/abs/2604.01666

作者：Wonjoon Jin,Jiyun Won,Janghyeok Han,Qi Dai,Chong Luo,Seung-Hwan Baek,Sunghyun Cho

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring fine-grained motion, motion, involving highly dynamic, realistic videos involving, videos involving highly

备注： Accepted to CVPR 2026. Website: [this https URL](https://jinwonjoon.github.io/DynaVid/)

点击查看摘要

Abstract:Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.

105. 【2604.01654】Moiré Video Authentication: A Physical Signature Against AI Video Generation

链接：https://arxiv.org/abs/2604.01654

作者：Yuan Qing,Kunyu Zheng,Lingxiao Li,Boqing Gong,Chang Xiao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：made AI-synthesized content, AI-synthesized content increasingly, content increasingly difficult, Recent advances, generation have made

备注： 17 pages, 14 figures

点击查看摘要

Abstract:Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.

106. 【2604.01646】MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label

链接：https://arxiv.org/abs/2604.01646

作者：Junyoung Jung,Seokwon Kim,Jun Uk Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved impressive performance, densely annotated datasets, achieved impressive, impressive performance, performance on densely

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at this https URL .

107. 【2604.01644】OL: Textual Localization with OpenStreetMap

链接：https://arxiv.org/abs/2604.01644

作者：Youqi Liao,Shuhao Kang,Jingyu Xu,Olaf Wysocki,Yan Xia,Jianping Li,Zhen Dong,Bisheng Yang,Xieyuanli Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：express spatial intent, Natural language, geospatial applications, express spatial, spatial intent

备注： Tech repo

点击查看摘要

Abstract:Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: this https URL.

108. 【2604.01641】LivingWorld: Interactive 4D World Generation with Environmental Dynamics

链接：https://arxiv.org/abs/2604.01641

作者：Hyeongju Mun,In-Hwan Jin,Sohyeong Kim,Kyeongbo Kong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：environmental dynamics, motion, dynamics, scene, environmental

备注：

点击查看摘要

Abstract:We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at this http URL.

109. 【2604.01619】Automatic Image-Level Morphological Trait Annotation for Organismal Images

链接：https://arxiv.org/abs/2604.01619

作者：Vardaan Pahuja,Samuel Stevens,Alyson East,Sydne Record,Yu Su

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：provide vital clues, organisms interact, physical characteristics, provide vital, vital clues

备注： ICLR 2026

点击查看摘要

Abstract:Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

110. 【2604.01618】x3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

链接：https://arxiv.org/abs/2604.01618

作者：Jiawei Chen,Simin Huang,Jiawei Du,Shuaihang Chen,Yu Tian,Mingjie Wei,Chao Yu,Zhaoxia Yin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shown strong performance, models have shown, shown strong, VLA, physically realizable adversarial

备注：

点击查看摘要

Abstract:Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

111. 【2604.01612】NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

链接：https://arxiv.org/abs/2604.01612

作者：Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Hyunsu Go,Eunseob Choi,Seongbin Park,Junsu Lim,Jiwon Yang,Sumin Lee,Insung Hwang,Ken Ying-Kai Liao,Nam-Joon Kim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：motivating self-supervised learning, clinical diagnosis, volumes is expensive, expensive and time-consuming, motivating self-supervised

备注： 5 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.

112. 【2604.01605】F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling

链接：https://arxiv.org/abs/2604.01605

作者：Morui Zhu,Mohammad Dehghani Tezerjani,Mátyás Szántó,Márton Vaitkus,Song Fu,Qing Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Gaussian Splatting framework, Splatting framework, Gaussian Splatting, Splatting, centralized

备注： Accepted to the CVPR 2026 SPAR-3D Workshop

点击查看摘要

Abstract:We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client's contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.

113. 【2604.01603】owards Minimal Focal Stack in Shape from Focus

链接：https://arxiv.org/abs/2604.01603

作者：Khurram Ashfaq,Muhammad Tariq Mahmood

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：focus variations observed, estimates scene structure, depth reconstruction technique, focal stack augmentation, focal stack

备注： Accepted to CVPRW 2026 (3DMV)

点击查看摘要

Abstract:Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.

114. 【2604.01598】Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition

链接：https://arxiv.org/abs/2604.01598

作者：Tianyi Shang,Zhenyu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language descriptions, localization enables robots, understand spatial positions, language descriptions, last-mile delivery

备注： 9 pages

点击查看摘要

Abstract:Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.

115. 【2604.01589】Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation

链接：https://arxiv.org/abs/2604.01589

作者：Wenjie Zhao,Jia Li,Xin Dong,Yapeng Tian,Yu Xiang,Yunhui Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：underline, coexist with in-distribution, OOD, OOD detection, distribution shifts

备注：

点击查看摘要

Abstract:Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.

116. 【2604.01586】SHOE: Semantic HOI Open-Vocabulary Evaluation Metric

链接：https://arxiv.org/abs/2604.01586

作者：Maja Noack,Qinqian Lei,Taipeng Tian,Bihan Dong,Robby T. Tan,Yixin Chen,John Young,Saijun Zhang,Bo Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：building scalable systems, grounded multimodal systems, support grounded multimodal, HOI, human-object relationships

备注： Accepted to GRAIL-V Workshop at CVPR 2026

点击查看摘要

Abstract:Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., "lean on couch" vs. "sit on couch"), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.

117. 【2604.01581】Satellite-Free Training for Drone-View Geo-Localization

链接：https://arxiv.org/abs/2604.01581

作者：Tao Liu,Yingzhi Zhang,Kan Ren,Xiaoqi Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Drone-view geo-localization, aims to determine, GPS-denied environments, environments by retrieving, reference gallery

备注：

点击查看摘要

Abstract:Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.

118. 【2604.01579】Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning

链接：https://arxiv.org/abs/2604.01579

作者：Longfei Huang,Yang Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：received increasing attention, emerging task, received increasing, increasing attention, GAAL

备注： ICME 26

点击查看摘要

Abstract:Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at this https URL.

119. 【2604.01569】VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

链接：https://arxiv.org/abs/2604.01569

作者：Jiahao Meng,Tan Yue,Qi Xu,Haochen Wang,Zhongwei Ren,Weisong Liu,Yuhao Wang,Renrui Zhang,Yunhai Tong,Haodong Duan

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Recent video multimodal, multimodal large language, large language models, video multimodal large, Recent video

备注：

点击查看摘要

Abstract:Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.

120. 【2604.01561】ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction

链接：https://arxiv.org/abs/2604.01561

作者：Yanzhe Liang,Ruijie Zhu,Hanzhi Chang,Zhuoyuan Li,Jiahao Lu,Tianzhu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：raw video, flow matching, dynamic scene reconstruction, dynamic scene, dynamic

备注： Project page: [this https URL](https://rosetta-leong.github.io/ReFlow_Page/) {this https URL}

点击查看摘要

Abstract:We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.

121. 【2604.01553】Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization

链接：https://arxiv.org/abs/2604.01553

作者：Zhanqiang Guo,Jianjiang Feng,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Convolutional Neural Networks, critical prerequisite, prerequisite for automated, automated diagnosis, Convolutional Neural

备注：

点击查看摘要

Abstract:Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.

122. 【2604.01550】Prototype-Based Low Altitude UAV Semantic Segmentation

链接：https://arxiv.org/abs/2604.01550

作者：Da Zhang,Gao Junyu,Zhao Zhiyuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：complex object boundaries, extreme scale variations, imagery presents unique, presents unique challenges, unique challenges due

备注： Accepted to ICME 2026

点击查看摘要

Abstract:Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86\% mIoU on UAVid and 80.92\% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at this https URL.

123. 【2604.01542】Universal computational thermal imaging overcoming the ghosting effect

链接：https://arxiv.org/abs/2604.01542

作者：Hongyi Xu,Du Wang,Chenjun Zhao,Jiashuo Chen,Jiale Lin,Liqin Cao,Yanfei Zhong,Yiyuan She,Fanglin Bao

类目：Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)

关键词：Thermal imaging, computational thermal imaging, night vision, fundamentally hampered, loss of detailed

备注： 9 pages, 6 figures

点击查看摘要

Abstract:Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces -- the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR's effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.

124. 【2604.01514】Why Instruction-Based Unlearning Fails in Diffusion Models?

链接：https://arxiv.org/abs/2604.01514

作者：Zeliang Zhang,Rui Sun,Jiani Liu,Qi Wu,Chenliang Xu

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：models remains unclear, generative models remains, inference time, remains unclear, modifying the behavior

备注：

点击查看摘要

125. 【2604.01479】UniRecGen: Unifying Multi-View 3D Reconstruction and Generation

链接：https://arxiv.org/abs/2604.01479

作者：Zhisheng Huang,Jiahao Chen,Cheng Lin,Chenyu Hu,Hanzhuo Huang,Zhengming Yu,Mengfei Li,Yuheng Liu,Zekai Gu,Zibo Zhao,Yuan Liu,Xin Li,Wenping Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modeling represents, generative plausibility, represents a fundamental, fundamental tension, Sparse-view

备注：

点击查看摘要

Abstract:Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.

126. 【2604.01474】Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation

链接：https://arxiv.org/abs/2604.01474

作者：Yunbei Zhang,Chengyi Cai,Feng Liu,Jihun Hamm

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：target tasks typically, Zeroth-Order Optimization, Optimization, API, tasks typically relies

备注： CVPR 2026

点击查看摘要

Abstract:Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS's effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.

127. 【2604.01466】Efficient Equivariant Transformer for Self-Driving Agent Modeling

链接：https://arxiv.org/abs/2604.01466

作者：Scott Xu,Dian Chen,Kelvin Wong,Chris Zhang,Kion Fallah,Raquel Urtasun

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Accurately modeling agent, Accurately modeling, modeling agent behaviors, important task, Accurately

备注： CVPR 2026

点击查看摘要

Abstract:Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra $\mathbb{R}^*_{2,0,1}$ and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.

128. 【2604.01460】Reinforcing Consistency in Video MLLMs with Structured Rewards

链接：https://arxiv.org/abs/2604.01460

作者：Yihao Quan,Zeru Shi,Jinman Zhao,Ruixiang Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, achieved remarkable progress, large language models, Multimodal large, video understanding

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.

129. 【2604.01453】Nonlinear Methods for Analyzing Pose in Behavioral Research

链接：https://arxiv.org/abs/2604.01453

作者：Carter Sale,Margaret C. Macpherson,Gaurav Patil,Kelly Miles,Rachel W. Kallen,Sebastian Wallot,Michael J. Richardson

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Advances in markerless, capture detailed human, markerless pose estimation, standard video, enabling new forms

备注： 40 pages, 13 figures

点击查看摘要

Abstract:Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline's flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.

130. 【2604.01447】Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars

链接：https://arxiv.org/abs/2604.01447

作者：Derek Austin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Gaussian splatting methods, splatting methods built, methods built atop, remarkable visual fidelity, built atop SMPL

备注：

点击查看摘要

Abstract:Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset's SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline's gains.

131. 【2604.01421】EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

链接：https://arxiv.org/abs/2604.01421

作者：Abhishek Saroha,Huajian Zeng,Xingxing Zuo,Daniel Cremers,Xi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：predicting object motion, perception and interaction, predicting object, video is fundamental, fundamental to embodied

备注： CVPR 2026: [this https URL](https://abhi-rf.github.io/egoflow/)

点击查看摘要

Abstract:Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.

132. 【2604.01388】LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

链接：https://arxiv.org/abs/2604.01388

作者：Fusang Wang,Nathan Piasco,Moussab Bennehar,Luis Roldão,Dzmitry Tsishkou,Fabien Moutarde

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, Recent advancements, scene understanding heavily, register vision-language features, understanding heavily rely

备注：

点击查看摘要

Abstract:Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.

133. 【2604.01383】GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

链接：https://arxiv.org/abs/2604.01383

作者：Syed Ahsan Masud Zaidi,Lior Shamir,William Hsu,Scott Dietrich,Talha Zaidi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：American football practice, American football, football practice generates, interest occupies, practice generates video

备注： 9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: [this https URL](https://github.com/AhsanZaidi12/GRAZE)

点击查看摘要

Abstract:American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

134. 【2604.01371】AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

链接：https://arxiv.org/abs/2604.01371

作者：Aiza Maksutova,Lalithkumar Seenivasan,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Yiqing Shen,Mathias Unberath

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词：surgeon-like dexterous control, achieving surgeon-like dexterous, dexterous control, driven primarily, progressed rapidly

备注：

点击查看摘要

Abstract:Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

135. 【2604.01361】IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation

链接：https://arxiv.org/abs/2604.01361

作者：Nermin Samet,Gilles Puy,Renaud Marlet

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：automotive lidar data, open-vocabulary semantic segmentation, zero-shot open-vocabulary semantic, Vision Language Models, Vision Foundation Model

备注：

点击查看摘要

Abstract:This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at this https URL.

136. 【2604.01341】Perceptual misalignment of texture representations in convolutional neural networks

链接：https://arxiv.org/abs/2604.01341

作者：Ludovica de Paolis,Fabio Anselmi,Alessio Ansuini,Eugenio Piasini

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Mathematical modeling, back to Julesz, Julesz intuition, textures traces back, visual textures traces

备注：

点击查看摘要

Abstract:Mathematical modeling of visual textures traces back to Julesz's intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such "texture representations" spontaneously align with the textures' perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models' perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.

137. 【2604.01339】Regularizing Attention Scores with Bootstrapping

链接：https://arxiv.org/abs/2604.01339

作者：Neo Christopher Chung,Maxim Laletin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

关键词：attention scores, attention, scores, decision-making process, mechanism to weigh

备注：

点击查看摘要

Abstract:Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)

Cite as:
arXiv:2604.01339 [cs.CV]

(or
arXiv:2604.01339v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.01339

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Artificial Intelligence and Statistics (AISTATS) 2026

138. 【2604.01337】SECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous Driving

链接：https://arxiv.org/abs/2604.01337

作者：Wenjing Wang,Wenxuan Wang,Songning Lai

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced accident anticipation, real-world perturbations remains, significantly advanced accident, Stable Early Collision, Understanding Robust Embeddings

备注： 13 pages, 2 figures

点击查看摘要

Abstract:While deep learning has significantly advanced accident anticipation, the robustness of these safety-critical systems against real-world perturbations remains a major challenge. We reveal that state-of-the-art models like CRASH, despite their high performance, exhibit significant instability in predictions and latent representations when faced with minor input perturbations, posing serious reliability risks. To address this, we introduce SECURE - Stable Early Collision Understanding Robust Embeddings, a framework that formally defines and enforces model robustness. SECURE is founded on four key attributes: consistency and stability in both prediction space and latent feature space. We propose a principled training methodology that fine-tunes a baseline model using a multi-objective loss, which minimizes divergence from a reference model and penalizes sensitivity to adversarial perturbations. Experiments on DAD and CCD datasets demonstrate that our approach not only significantly enhances robustness against various perturbations but also improves performance on clean data, achieving new state-of-the-art results.

139. 【2604.01322】Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset

链接：https://arxiv.org/abs/2604.01322

作者：Léa Drolet-Roy,Victor Nogues,Sylvain Gaudet,Eve Charbonneau,Mickaël Begon,Lama Séoud

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：art pose estimation, Trampoline gymnastics involves, gymnastics involves extreme, estimation models tend, uncommon viewpoints

备注：

点击查看摘要

Abstract:Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.

140. 【2604.01318】ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos

链接：https://arxiv.org/abs/2604.01318

作者：Syed Ahsan Masud Zaidi,William Hsu,Scott Dietrich

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：sports enables timely, enables timely intervention, Early identification, contact sports enables, improves player safety

备注： 15 pages, 4 figures. Accepted to ICPR 2026 (28th International Conference on Pattern Recognition)

点击查看摘要

Abstract:Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.

141. 【2604.01310】Sparse Spectral LoRA: Routed Experts for Medical VLMs

链接：https://arxiv.org/abs/2604.01310

作者：Omid Nejati Manzari,Hojat Asgariandehkordi,Taha Koleilat,Yiming Xiao,Hassan Rivaz

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large vision-language models, heterogeneous supervision induces, supervision induces cross-dataset, induces cross-dataset interference, Large vision-language

备注：

点击查看摘要

Abstract:Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5\% where strong baselines degrade by $$20-50\%.

142. 【2604.01280】Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

链接：https://arxiv.org/abs/2604.01280

作者：Marco Morini,Sara Sarto,Marcella Cornia,Lorenzo Baraldi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, requires combining visual, combining visual understanding, Multimodal Large Language, external knowledge

备注： Project Page: [this https URL](https://aimagelab.github.io/LoT/)

点击查看摘要

143. 【2604.01274】Non-Rigid 3D Shape Correspondences: From Foundations to Open Challenges and Opportunities

链接：https://arxiv.org/abs/2604.01274

作者：Aleksei Zhuravlev,Lennart Bastian,Dongliang Cao,Nafie El Amrani,Paul Roetzer,Viktoria Ehm,Riccardo Marin,Hiroki Nishizawa,Shigeo Morishima,Christian Theobalt,Nassir Navab,Daniel Cremers,Florian Bernard,Zorah Lähner,Vladislav Golyanik

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：deformed shape instances, computer graphics, statistical modelling, rely on recovering, numerous applications

备注： 35 pages and 15 figures; Eurographics 2026 STAR; Project page: [this https URL](https://nonrigid-shape-correspondences.github.io)

点击查看摘要

Abstract:Estimating correspondences between deformed shape instances is a long-standing problem in computer graphics; numerous applications, from texture transfer to statistical modelling, rely on recovering an accurate correspondence map. Many methods have thus been proposed to tackle this challenging problem from varying perspectives, depending on the downstream application. This state-of-the-art report is geared towards researchers, practitioners, and students seeking to understand recent trends and advances in the field. We categorise developments into three paradigms: spectral methods based on functional maps, combinatorial formulations that impose discrete constraints, and deformation-based methods that directly recover a global alignment. Each school of thought offers different advantages and disadvantages, which we discuss throughout the report. Meanwhile, we highlight the latest developments in each area and suggest new potential research directions. Finally, we provide an overview of emerging challenges and opportunities in this growing field, including the recent use of vision foundation models for zero-shot correspondence and the particularly challenging task of matching partial shapes.

144. 【2604.01251】Camouflage-aware Image-Text Retrieval via Expert Collaboration

链接：https://arxiv.org/abs/2604.01251

作者：Yao Jiang,Zhongkuan Mao,Xuan Wu,Keren Fu,Qijun Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：broad practical implications, Camouflaged scene understanding, significant attention due, attracted significant attention, practical implications

备注：

点击查看摘要

Abstract:Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at this https URL.

145. 【2604.01234】CLPIPS: A Personalized Metric for AI-Generated Image Similarity

链接：https://arxiv.org/abs/2604.01234

作者：Khoi Trinh,Jay Rothenberger,Scott Seidenberger,Dimitrios Diochnos,Anindya Maiti

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：Iterative prompt refinement, Iterative prompt, image generative models, human, generative models

备注：

点击查看摘要

Abstract:Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.

146. 【2604.01226】DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

链接：https://arxiv.org/abs/2604.01226

作者：Xinhao Huang,Jinke Yu,Wenhao Xu,Zeyi Wen,Ying Zhou,Junzhuo Liu,Junhao Ji,Zulong Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)

关键词：Vision Language Models, Language Models, Vision Language, reconcile high-level structural, high-level structural hierarchy

备注：

点击查看摘要

Abstract:While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.

147. 【2401.15855】Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing

链接：https://arxiv.org/abs/2401.15855

作者：Maofeng Tang,Andrei Cozma,Konstantinos Georgiou,Hairong Qi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：extensive geographic coverage, image analysis due, present unique challenges, Remote sensing, remote sensing MAE

备注：

点击查看摘要

Abstract:Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.

148. 【2604.02105】DenOiS: Dual-Domain Denoising of Observation and Solution in Ultrasound Image Reconstruction

链接：https://arxiv.org/abs/2604.02105

作者：Can Deniz Bezek,Orcun Goksel

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：Medical imaging aims, underlying tissue properties, recover underlying tissue, Medical imaging, tissue properties

备注：

点击查看摘要

Abstract:Medical imaging aims to recover underlying tissue properties, using inexact (simplified/linearized) imaging models and often from inaccurate and incomplete measurements. Analytical reconstruction methods rely on hand-crafted regularization, sensitive to noise assumptions and parameter tuning. Among deep learning alternatives, plug-and-play (PnP) approaches learn regularization while incorporating imaging physics during inference, outperforming purely data-driven methods. The performance of all these approaches, however, still strongly depends on measurement quality and imaging model accuracy. In this work, we propose DenOiS, a framework that denoises both input observations and resulting solution in their respective domains. It consists of an observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, and a diffusion-based PnP reconstruction approach that remains robust under missing measurements. DenOiS enables generalization to real data from training only in simulations, resulting in high-fidelity image reconstruction with noisy observations and inexact imaging models. We demonstrate this for speed-of-sound imaging as a challenging setting of quantitative ultrasound image reconstruction.

149. 【2604.02074】Country-wide, high-resolution monitoring of forest browning with Sentinel-2

链接：https://arxiv.org/abs/2604.02074

作者：Samantha Biegel,David Brüggemann,Francesco Grossi,Michele Volpi,Konrad Schindler,Benjamin D. Stocker

类目：Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)

关键词：Natural and anthropogenic, impacting the health, Monitoring forest disturbances, forests worldwide, anthropogenic disturbances

备注： 9 pages, 7 figures, to be published in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Congress)

点击查看摘要

Abstract:Natural and anthropogenic disturbances are impacting the health of forests worldwide. Monitoring forest disturbances at scale is important to inform conservation efforts. Here, we present a scalable approach for country-wide mapping of forest greenness anomalies at the 10 m resolution of Sentinel-2. Using relevant ecological and topographical context and an established representation of the vegetation cycle, we learn a predictive quantile model of the normalised difference vegetation index (NDVI) derived from Sentinel-2 data. The resulting expected seasonal cycles are used to detect NDVI anomalies across Switzerland between April 2017 and August 2025. Goodness-of-fit evaluations show that the conditional model explains 65% of the observed variations in the median seasonal cycle. The model consistently benefits from the local context information, particularly during the green-up period. The approach produces coherent spatial anomaly patterns and enables country-wide quantification of forest browning. Case studies with independent reference data from known events illustrate that the model reliably detects different types of disturbances.

150. 【2604.01857】Enhanced Polarization Locking in VCSELs

链接：https://arxiv.org/abs/2604.01857

作者：Zifeng Yuan,Dewen Zhang,Lei Shi,Yutong Liu,Aaron Danner

类目：Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)

关键词：vertical-cavity surface-emitting lasers, surface-emitting lasers, vertical-cavity surface-emitting, widely studied, optical injection locking

备注：

点击查看摘要

Abstract:While optical injection locking (OIL) of vertical-cavity surface-emitting lasers (VCSELs) has been widely studied in the past, the polarization dynamics of OIL have received far less attention. Recent studies suggest that polarization locking via OIL could enable novel computational applications such as polarization-encoded Ising computers. However, the inherent polarization preference and limited polarization switchability of VCSELs hinder their use for such purposes. To address these challenges, we fabricate VCSELs with tailored oxide aperture designs and combine these with bias current tuning to study the overall impact on polarization locking. Experimental results demonstrate that this approach reduces the required injection power (to as low as 3.6 {\mu}W) and expands the locking range. To investigate the impact of the approach, the spin-flip model (SFM) is used to analyze the effects of amplitude anisotropy and bias current on polarization locking, demonstrating strong coherence with experimental results.