本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新550篇论文,其中:

  • 自然语言处理81
  • 信息检索19
  • 计算机视觉86

自然语言处理

1. 【2604.25917】Recursive Multi-Agent Systems

链接https://arxiv.org/abs/2604.25917

作者:Xiyuan Yang,Jiaru Zou,Rui Pan,Ruizhong Qiu,Pan Lu,Shizhe Diao,Jindong Jiang,Hanghang Tong,Tong Zhang,Markus J. Buehler,Jingrui He,James Zou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:looped language models, deepen reasoning, looped language, recently emerged, axis by iteratively

备注: 36 Pages. Project Website: [this https URL](https://recursivemas.github.io)

点击查看摘要

Abstract:Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2$\times$-2.4$\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in this https URL.

2. 【2604.25914】DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

链接https://arxiv.org/abs/2604.25914

作者:Jinxiang Meng,Shaoping Huang,Fangyu Lei,Jingyu Guo,Haoxiang Liu,Jiahao Su,Sihan Wang,Yao Wang,Enrui Wang,Ye Yang,Hongze Chai,Jinming Lv,Anbang Yu,Huangjing Zhang,Yitong Zhang,Yiming Huang,Zeyao Ma,Shizhu He,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL)

关键词:requires native environmental, native environmental grounding, cross-platform evolution, proactive intent alignment, environmental grounding

备注

点击查看摘要

Abstract:Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{this https URL}{this project page}.

3. 【2604.25905】A paradox of AI fluency

链接https://arxiv.org/abs/2604.25905

作者:Christopher Potts,Moritz Sudhof

类目:Computation and Language (cs.CL)

关键词:user skill, novices, Abstract, fluent users, complex tasks

备注

点击查看摘要

Abstract:How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at this https URL

4. 【2604.25902】oward a Functional Geometric Algebra for Natural Language Semantics

链接https://arxiv.org/abs/2604.25902

作者:James Pustejovsky

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural language semantics, conventional linear algebra, approaches to natural, natural language, built almost exclusively

备注: 43 pages. Keywords: geometric algebra, Clifford algebra, compositional semantics, natural language semantics, type coercion, multivector representations, graded type system, Generative Lexicon, neural language models, distributional semantics

点击查看摘要

Abstract:Distributional and neural approaches to natural language semantics have been built almost exclusively on conventional linear algebra: vectors, matrices, tensors, and the operations that accompany them. These methods have achieved remarkable empirical success, yet they face persistent structural limitations in compositional semantics, type sensitivity, and interpretability. I argue in this paper that geometric algebra (GA) -- specifically, Clifford algebras -- provides a mathematically superior foundation for semantic representation, and that a Functional Geometric Algebra (FGA) framework extends GA toward a typed, compositional semantics capable of supporting inference, transformation, and interpretability while retaining full compatibility with distributional learning and modern neural architectures. I develop the formal foundations, identify three core capabilities that GA provides and linear algebra does not, present a detailed worked example illustrating operator-level semantic contrasts, and show how GA-based operations already implicit in current transformer architectures can be made explicit and extended. The central claim is not merely increased dimensionality but increased structural organization: GA expands an $n$-dimensional embedding space into a $2^n$ multivector algebra where base semantic concepts and their higher-order interactions are represented within a single, principled algebraic framework.

5. 【2604.25895】hree Models of RLHF Annotation: Extension, Evidence, and Authority

链接https://arxiv.org/abs/2604.25895

作者:Steve Coyne

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:prominently Reinforcement Learning, Reinforcement Learning, Human Feedback, Preference-based alignment methods, shape large language

备注: 17 pages. Accepted to ACM FAccT '26, June 25-28, Montreal

点击查看摘要

Abstract:Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

6. 【2604.25866】From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLMs

链接https://arxiv.org/abs/2604.25866

作者:Bangzhao Shu,Arinjay Singh,Mai ElSherief

类目:Computation and Language (cs.CL)

关键词:sensitive human-AI applications, emotionally sensitive human-AI, Large language models, Large language, emotion recognition

备注: 18 pages including appendix

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using phase-stratified causal tracing, we identify a small set of features that strongly influence emotion predictions, and show that both their number and causal impact vary across emotions; in particular, Disgust is more weakly and diffusely represented than other emotions. Finally, we propose an interpretable and data-efficient causal feature steering method that significantly improves emotion recognition performance across multiple models while largely preserving language modeling ability, and demonstrate that these improvements generalize across multiple emotion recognition datasets. Overall, our findings provide a systematic analysis of the internal mechanisms underlying emotion recognition in LLMs and introduce an efficient, interpretable, and controllable approach for improving model performance.

7. 【2604.25860】Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

链接https://arxiv.org/abs/2604.25860

作者:Lucio La Cava,Andrea Tagarelli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:requires identifying structurally, identifying structurally invariant, structurally invariant signals, detection requires identifying, model-specific fingerprints

备注

点击查看摘要

Abstract:Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on model-specific fingerprints. In this respect, we hypothesize that while large language models excel at local semantic consistency, their autoregressive nature results in a specific kind of structural fragility compared to human writing. We propose Luminol-AIDetect, a novel, zero-shot statistical approach that exposes this fragility through coherence disruption. By applying a simple randomized text-shuffling procedure, we demonstrate that the resulting shift in perplexity serves as a principled, model-agnostic discriminant, as MGT displays a characteristic dispersion in perplexity-under-shuffling that differs markedly from the more stable structural variability of human-written text. Luminol-AIDetect leverages this distinction to inform its decision process, where a handful of perplexity-based scalar features are extracted from an input text and its shuffled version, then detection is performed via density estimation and ensemble-based prediction. Evaluated across 8 content domains, 11 adversarial attack types, and 18 languages, Luminol-AIDetect demonstrates state-of-the-art performance, with gains up to 17x lower FPR while being cheaper than prior methods.

8. 【2604.25853】G-Loss: Graph-Guided Fine-Tuning of Language Models

链接https://arxiv.org/abs/2604.25853

作者:Sharma Aditya,Agarwal Vinti,Kumar Rajesh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:pervised contrastive losses, fine-tuning pre-trained language, Traditional loss functions, pre-trained language models, global semantic structure

备注: 20 pages, Learning on Graphs (LoG2025)

点击查看摘要

Abstract:Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

9. 【2604.25850】Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

链接https://arxiv.org/abs/2604.25850

作者:Jiahang Lin,Shichun Liu,Chengjun Pan,Lizhi Lin,Shihan Dou,Xuanjing Huang,Hang Yan,Zhenhua Han,Tao Gui

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:interact with repositories, execution environments, central determinant, Agentic Harness Engineering, harness

备注

点击查看摘要

Abstract:Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.

10. 【2604.25840】PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

链接https://arxiv.org/abs/2604.25840

作者:Nguyen Khoi Hoang,Shuhaib Mehri,Tse-An Hsu,Yi-Jyun Sun,Quynh Xuan Nguyen Truong,Khoa D Doan,Dilek Hakkani-Tür

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:mental health training, providing scalable exposure, sensitive patient interactions, gaining traction, traction in mental

备注

点击查看摘要

Abstract:Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.

11. 【2604.25806】MAIC-UI: Making Interactive Courseware with Generative UI

链接https://arxiv.org/abs/2604.25806

作者:Shangqing Tu,Yanjia Li,Keyu Chen,Sichen Zhang,Jifan Yu,Daniel Zhang-Li,Lei Hou,Juanzi Li,Yu Zhang,Huiqin Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Creating interactive STEM, traditionally requires HTML, JavaScript expertise, Creating interactive, leaving barriers

备注: You can try our demo at [this https URL](https://open.maic.chat/)

点击查看摘要

Abstract:Creating interactive STEM courseware traditionally requires HTML/CSS/JavaScript expertise, leaving barriers for educators. While generative AI can produce HTML codes, existing tools generate static presentations rather than interactive simulations, struggle with long documents, and lack pedagogical accuracy mechanisms. Furthermore, full regeneration for modifications requires 200--600 seconds, disrupting creative flow. We present MAIC-UI, a zero-code authoring system that enables educators to create and rapidly edit interactive courseware from textbooks, PPTs, and PDFs. MAIC-UI employs: (1) structured knowledge analysis with multi-modal understanding to ensure pedagogical rigor; (2) a two-stage generate-verify-optimize pipeline separating content alignment from visual refinement; and (3) Click-to-Locate editing with Unified Diff-based incremental generation achieving sub-10-second iteration cycles. A controlled lab study with 40 participants shows MAIC-UI reduces editing iterations (4.9 vs. 7.0) and significantly improves learnability and controllability compared to direct Text-to-HTML generation. A three-month classroom deployment with 53 high school students demonstrates that MAIC-UI fosters learning agency and reduces outcome disparities -- the pilot class achieved 9.21-point gains in STEM subjects compared to -2.32 points in control classes. Our code is available at this https URL.

12. 【2604.25800】Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

链接https://arxiv.org/abs/2604.25800

作者:Oliver Kraus,Yash Sarrof,Yuekun Yao,Alexander Koller,Michael Hahn

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:improve Transformers' performance, Transformers' performance, empirically improve Transformers', theoretically increase, Turing completeness

备注: Oliver Kraus and Yash Sarrof contributed equally as first authors. Alexander Koller and Michael Hahn are co-senior authors. Code: [this https URL](https://github.com/coli-saar/BarriersToUniversalReasoningWTransformers)

点击查看摘要

Abstract:Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.

13. 【2604.25783】Subliminal Steering: Stronger Encoding of Hidden Signals

链接https://arxiv.org/abs/2604.25783

作者:George Morgulis,John Hewitt

类目:Computation and Language (cs.CL)

关键词:student language model, language model inheriting, innocuous data generated, biased teacher model, seemingly innocuous data

备注

点击查看摘要

Abstract:Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide mechanistic evidence that subliminal learning transfers not only the target behavioral bias, but also the steering vector itself, localized to the layers at which the teacher was steered. Finally, we show that the bias is encoded with surprising precision. We train a new steering vector directly on the subliminally-laden dataset and find that it attains high cosine similarity with the original vector.

14. 【2604.25776】Unrequited Emotions: Investigating the Gaps in Motivation and Practice in Speech Emotion Recognition Research

链接https://arxiv.org/abs/2604.25776

作者:Taryn Wong,Zeerak Talat,Hanan Aldarmaki,Anjalie Field

类目:Computation and Language (cs.CL)

关键词:Critical analyses, potential downstream impacts, emotion recognition technology, urging researchers, technology have raised

备注: Accepted to the Workshop on Computational Affective Science (CAS) at LREC 2026

点击查看摘要

Abstract:Critical analyses of emotion recognition technology have raised ethical concerns around task validity and potential downstream impacts, urging researchers to ensure alignment between their stated motivations and practice. However, these discussions have not adequately influenced or drawn from research on speech emotion recognition (SER). We address this gap by conducting a systematic survey of SER research to uncover what stated motivations drive this work and if they align with the datasets and emotions studied. We find that while SER research identifies appealing goals, such as well-situated voice-activated systems or healthcare applications, commonly-used datasets do not reflect these proposed deployment contexts, thus presenting a gap between motivations and research practices. We argue that such gaps engender ethical concerns, and that SER research should reassert itself with concrete use-cases to prevent misinterpretations, misuse, and downstream harms.

15. 【2604.25774】CGU-ILALab at FoodBench-QA 2026: Comparing Traditional and LLM-based Approaches for Recipe Nutrient Estimation

链接https://arxiv.org/abs/2604.25774

作者:Wei-Chun Chen,Yu-Xuan Chen,I-Fang Chung,Ying-Jia Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:variable quantity expressions, unstructured recipe text, highly variable quantity, Accurate nutrient estimation, quantity expressions

备注: Accepted by the Third Workshop on Patient-oriented Language Processing (CL4Health) at LREC 2026

点击查看摘要

Abstract:Accurate nutrient estimation from unstructured recipe text is an important yet challenging problem in dietary monitoring, due to ambiguous ingredient terminology and highly variable quantity expressions. We systematically evaluate models spanning a wide range of representational capacity, from lexical matching methods (TF-IDF with Ridge Regression), to deep semantic encoders (DeBERTa-v3), to generative reasoning with large language models (LLMs). Under the strict tolerance criteria defined by EU Regulation 1169/2011, our empirical results reveal a clear trade-off between predictive accuracy and computational efficiency. The TF-IDF baseline achieves moderate nutrient estimation performance with near-instantaneous inference, whereas the DeBERTa-v3 encoder performs poorly under task-specific data scarcity. In contrast, few-shot LLM inference (e.g., Gemini 2.5 Flash) and a hybrid LLM refinement pipeline (TF-IDF combined with Gemini 2.5 Flash) deliver the highest validation accuracy across all nutrient categories. These improvements likely arise from the ability of LLMs to leverage pre-trained world knowledge to resolve ambiguous terminology and normalize non-standard units, which remain difficult for purely lexical approaches. However, these gains come at the cost of substantially higher inference latency, highlighting a practical deployment trade-off between real-time efficiency and nutritional precision in dietary monitoring systems.

16. 【2604.25720】oward Multimodal Conversational AI for Age-Related Macular Degeneration

链接https://arxiv.org/abs/2604.25720

作者:Ran Gu,Benjamin Hou,Mélanie Hébert,Asmita Indurkar,Yifan Yang,Emily Y. Chew,Tiarnán D. L. Keenan,Zhiyong Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:retinal disease detection, deep learning models, systems produce static, disease detection, deep learning

备注: 38 pages, 4 figures

点击查看摘要

Abstract:Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.

17. 【2604.25716】Cross-Lingual Jailbreak Detection via Semantic Codebooks

链接https://arxiv.org/abs/2604.25716

作者:Shirin Alanova,Bogdan Minko,Sabrina Sadiekh,Evgeniy Kokuykin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remain predominantly English-centric, predominantly English-centric, creating systematic vulnerabilities, remain predominantly, mechanisms for large

备注

点击查看摘要

Abstract:Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

18. 【2604.25702】Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

链接https://arxiv.org/abs/2604.25702

作者:Mehrdad Ghassabi,Spehr Rajabi,Hamidreza Baradaran Kashani,Sadra Hakim,Mahshid Keivandarian

类目:Computation and Language (cs.CL)

关键词:Contemporary neural machine, supervised parallel data, Contemporary neural, neural machine translation, parallel data

备注: 5 pages, 2 figures

点击查看摘要

Abstract:Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.

19. 【2604.25676】CORAL: Adaptive Retrieval Loop for Culturally-Aligned Multilingual RAG

链接https://arxiv.org/abs/2604.25676

作者:Nayeon Lee,Jiwoo Song,Byeongcheol Kang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multilingual retrieval-augmented generation, multilingual embedding vector, embedding vector representations, Multilingual retrieval-augmented, multilingual embedding

备注: 23 pages, 9 figures. Accepted at ACL 2026 (Findings)

点击查看摘要

Abstract:Multilingual retrieval-augmented generation (mRAG) is often implemented within a fixed retrieval space, typically via query or document translation or multilingual embedding vector representations. However, this approach may be inadequate for culturally grounded queries, in which retrieval-condition misalignment may occur. Even strong retrievers and generators may struggle to produce culturally relevant answers when sourcing evidence from inappropriate linguistic or regional contexts. To this end, we introduce CORAL (COntext-aware Retrieval with Agentic Loop, an adaptive retrieval methodology for mRAG that enables iterative refinement of both the retrieval space (corpora) and the retrieval probe (query) based on the quality of the evidence. The overall process includes: (1) selecting corpora, (2) retrieving documents, (3) critiquing evidence for relevance and cultural alignment, and (4) checking sufficiency. If the retrieved documents are insufficient to answer the query correctly, the system (5) reselects corpora and rewrites the query. Across two cultural QA benchmarks, CORAL achieves up to a 3.58%p accuracy improvement on low-resource languages relative to the strongest baselines.

20. 【2604.25674】Modeling Human-Like Color Naming Behavior in Context

链接https://arxiv.org/abs/2604.25674

作者:Yuqing Zhang,Ecesu Ürker,Tessa Verhoef,Gemma Boleda,Arianna Bisazza

类目:Computation and Language (cs.CL)

关键词:Modeling the emergence, interacting neural agents, communicative pressures, neural agents, interacting neural

备注: Cognitive Science Society Annual Conference 2026

点击查看摘要

Abstract:Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.

21. 【2604.25665】LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

链接https://arxiv.org/abs/2604.25665

作者:Huyen Nguyen,Haoxuan Zhang,Yang Zhang,Junhua Ding,Haihua Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:generated summaries remains, large language model, Reliable evaluation, open challenge, large language

备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

22. 【2604.25654】Progressing beyond Art Masterpieces or Touristic Clichés: how to assess your LLMs for cultural alignment?

链接https://arxiv.org/abs/2604.25654

作者:António Branco,João Silva,Nuno Marques,Luis Gomes,Ricardo Campos,Raquel Sequeira,Sara Nerea,Rodrigo Silva,Miguel Marques,Rodrigo Duarte,Artur Putyato,Diogo Folques,Tiago Valente

类目:Computation and Language (cs.CL)

关键词:Large Language Models, attracted increasing attention, Large Language, alignment of Large, cultural bias

备注: RESOURCEFUL-2026 Workshop at LREC 2026

点击查看摘要

Abstract:Although the cultural (mis)alignment of Large Language Models (LLMs) has attracted increasing attention -- often framed in terms of cultural bias -- until recently there has been limited work on the design and development of datasets for cultural assessment. Here, we review existing approaches to such datasets and identify their main limitations. To address these issues, we propose design guidelines for annotators and report on the construction of a dataset built according to these principles. We further present a series of contrastive experiments conducted with this dataset. The results demonstrate that our design yields test sets with greater discriminative power, effectively distinguishing between models specialized for a given culture and those that are not, ceteris paribus.

23. 【2604.25634】he Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

链接https://arxiv.org/abs/2604.25634

作者:Alex Bogdan,Adrian de Valois-Franklin

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Toggle, striking statistical regularity, frontier LLM outputs, existing sampling-based detectors, Toggle Hugging Face

备注: 25 pages, 6 figures, 6 tables, 37 references. Code and data: [this https URL](https://github.com/Evolutionairy-AI/Ranking-Inference)

点击查看摘要

Abstract:We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000$\times$ (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding $R^{2} = 0.94$ and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in $q$ (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor-delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent-substitution audits. Second, a model-agnostic reference distribution for black-box output assessment, from which we derive a single-pass scoring primitive that composes with model log probabilities when available and degrades to a rank-only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain-appropriate vocabulary). We position the primitive as a first-pass triage layer in compound evaluation stacks, not as a replacement for sampling-based or source-conditioned verifiers.

Comments:
25 pages, 6 figures, 6 tables, 37 references. Code and data: this https URL

Subjects:

Cryptography and Security (cs.CR); Computation and Language (cs.CL)

Cite as:
arXiv:2604.25634 [cs.CR]

(or
arXiv:2604.25634v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2604.25634

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Alex Bogdan [view email] [v1]
Tue, 28 Apr 2026 13:35:31 UTC (251 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive, by Alex Bogdan and 1 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CR

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

24. 【2604.25611】WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

链接https://arxiv.org/abs/2604.25611

作者:Erfan Ramezani,Mohammad Mahdi Giahi,Mohammad Erfan Zarabadipour,Amir Reza Yosefian,Hamid Ghadiri

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:deploying large-scale transformer, automatic speech recognition, Voice Activity Detection, large-scale transformer models, computational efficiency

备注: 36 pages, 14 figures. Open-source implementation available at PyPI

点击查看摘要

Abstract:Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.

25. 【2604.25580】Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

链接https://arxiv.org/abs/2604.25580

作者:David Hartmann,Manuel Tonneau,Angelie Kraft,LK Seiling,Dimitri Staufer,Pieter Delobelle,Jan Fillies,Anna Ricarda Luther,Jan Batzner,Mareike Lisker

类目:Computation and Language (cs.CL)

关键词:Perspective API, LLM evaluation research, automated toxicity measurement, collective research efforts, facto standard

备注: 13 pages, 1 figure, 1 table

点击查看摘要

Abstract:The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts. Perspective's model was periodically updated without versioning or disclosure, its annotation structure reflected a single corporate operationalisation of a contested concept, and its scores were used simultaneously as an evaluation target and an evaluation standard. Its closure leaves behind non-updatable benchmarks, irreproducible results, and ultimately a field at risk of perpetuating these issues by turning to closed-source LLMs. We use Perspective's announced termination as an opportunity to call for an independent, valid, adaptable, and reproducible toxicity and hate speech measurement infrastructure, with the technical and governance requirements outlined in this paper.

26. 【2604.25578】Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling

链接https://arxiv.org/abs/2604.25578

作者:Fan Jiang,Yu Zhao,Chenyang Lyu,Tianqi Shi,Yichao Du,Feihu Jiang,Longyue Wang,Weihua Luo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:fully open multilingual, open multilingual sparse, suite of fully, fully open, models

备注

点击查看摘要

Abstract:We present Marco-MoE, a suite of fully open multilingual sparse Mixture-of-Experts (MoE) models. Marco-MoE features a highly sparse design in which only around 5\% of the total parameters are activated per input token. This extreme sparsity, combined with upcycling from dense models, enables efficient pre-training on 5T tokens. Our models surpass similarly-sized competitors on English and multilingual benchmarks, achieving a best-in-class performance-to-compute ratio. We further post-train these models to create Marco-MoE-\textsc{Instruct} variants, which surpass the performance of competing models possessing $3$--$14\times$ more activated parameters. Our analysis reveals that Marco-MoE learns structured expert activation patterns shared across related languages, while maintaining highly specialized utilization for linguistically isolated ones. We further show that Marco-MoE allows for scalable language expansion without the interference typical of dense models. To support the community, we disclose our full training datasets, recipes, and model weights.

27. 【2604.25525】From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support

链接https://arxiv.org/abs/2604.25525

作者:Natalia Amat-Lefort,Mert Yazan,Amanda Cercas Curry,Flor Miriam Plaza-del-Arco

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Large Language Models, Large Language, emotional support, Language Models, LLM emotional support

备注: 28 pages (9 pages main text, 19 pages references and appendices), 14 figures. The first two authors contributed equally

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used not only for instrumental tasks, but as always-available and non-judgmental confidants for emotional support. Yet what drives adoption and how users perceive emotional support interactions across countries remains unknown. To address this gap, we present the first large-scale cross-cultural study of LLM use for emotional support, surveying 4,641 participants across seven countries (USA, UK, Germany, France, Spain, Italy, and The Netherlands). Our results show that adoption rates vary dramatically across countries (from 20% to 59%). Using mixed models that separate cultural effects from demographic composition, we find that: Being aged 25-44, religious, married, and of higher socioeconomic status are predictors of positive perceptions (trust, usage, perceived benefits), with socioeconomic status being the strongest. English-speaking countries consistently show more positive perceptions than Continental European countries. We further collect a corpus of 731 real multilingual prompts from user interactions, showing that users mainly seek help for loneliness, stress, relationship conflicts, and mental health struggles. Our findings reveal that LLM emotional support use is shaped by a complex sociotechnical landscape and call for a broader research agenda examining how these systems can be developed, deployed, and governed to ensure safe and informed access.

28. 【2604.25482】From World-Gen to Quest-Line: A Dependency-Driven Prompt Pipeline for Coherent RPG Generation

链接https://arxiv.org/abs/2604.25482

作者:Dominik Borawski,Marta Szulc,Robert Chudy,Małgorzata Giedrowicz,Piotr Mironowicz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, multi-layered role-playing game, shown strong potential, Language Models

备注: 13 pages, 1 figure, 5 listings

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential for narrative generation, but their use in complex, multi-layered role-playing game (RPG) worlds is still limited by issues of coherence, controllability, and structural consistency. This paper explores a dependency-aware, multi-stage prompt pipeline for procedural RPG content generation that models narrative dependencies through structured intermediate representations. The approach decomposes generation into sequential stages: world building, non-player character creation, player character creation, campaign-level quest planning, and quest expansion. Each stage conditions on structured JSON outputs from previous stages. By enforcing schemas and explicit data flow, the pipeline reduces narrative drift, limits hallucinations, and supports scalable creation of interconnected narrative elements. The system is evaluated qualitatively through human-centered analysis across multiple independent runs. Outputs are assessed using criteria such as structural completeness, internal consistency, narrative coherence, diversity, and actionability. Results show that the pipeline consistently generates logically sound and structurally valid RPG content, without quality degradation as complexity increases. Separating high-level campaign planning from detailed quest expansion improves both global structure and local storytelling. These findings suggest that dependency-aware prompt pipelines with structured intermediate representations are an effective design pattern for LLM-based procedural content generation. This approach may also generalize to other domains requiring sequential reasoning over evolving contextual states.

29. 【2604.25476】PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech

链接https://arxiv.org/abs/2604.25476

作者:Venkata Pushpak Teja Menta

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:evaluation measures intelligibility, evaluation measures, measures intelligibility, Phoneme Substitution Profile, Standard

备注: 8 pages, 7 tables. Companion paper to Praxy Voice (arXiv:submission id - 7506231). Code: [this https URL](https://github.com/praxelhq/psp-eval;) Centroids: [this https URL](https://huggingface.co/datasets/Praxel/psp-native-centroids)

点击查看摘要

Abstract:Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5-R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi Telugu Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.

30. 【2604.25456】An Investigation of Linguistic Biases in LLM-Based Recommendations

链接https://arxiv.org/abs/2604.25456

作者:Nitin Venkateswaran,Jason Ang,Deep Adhikari,Tarun Krishna Dasari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Yelp Open dataset, Southern American English, Yelp Inc., Walmart product reviews, Yelp Open

备注

点击查看摘要

Abstract:We investigate linguistic biases in LLM-based restaurant and product recommendations given prompts varying across Southern American English (AE), Indian English (IE), and Code-Switched Hindi-English dialects, using the Yelp Open dataset (Yelp Inc., 2023) and Walmart product reviews dataset (PromptCloud,2020). We add lists of restaurant and product names balanced by cuisine type and product category to the prompts given to the LLM, and we zero-shot prompt the LLMs in a cold-start setting to select the top-20 restaurant and product recommendations from these lists for each of the dialect-varied prompts. We prompt LLMs using different list samples across 20 seeds for better generalization, and aggregate per cuisine-type and per category response counts for each seed, question/prompt, and LLM model. We run mixed-effects regression models for each model family and topic (restaurant/product) with the aggregate response counts as the dependent, and conduct likelihood ratio tests for the fixed effects with post-hoc pairwise testing of estimated marginal means differences, to investigate group-level differences in recommendation counts by model size and dialect type. Results show that dialect plays a role in the type of restaurant selected across the models tested with the mistral-small-3.1 model and both the llama-3.1 family models tested showing more sensitivity to Indian English and Code-Switched prompts. In terms of product recommendations, the llama-3.1-70B-model is particularly sensitive to Code-Switched prompts in four out of seven categories, and more beauty and home category recommendations are seen when using the Indian English and Code-Switched prompts for larger and smaller models, respectively. No broad trends are seen in the model-size based differences, with differing recommendations based on model sizes conditioned by the type of dialect.

31. 【2604.25452】Benchmarking Logistic Regression, SVM, and LightGBM Against BiLSTM with Attention for Sentiment Analysis on Indonesian Product Reviews

链接https://arxiv.org/abs/2604.25452

作者:Razin Hafid Hamdi,Ivana Margareth Hutabarat,Hanna Gresia Sinaga,Luluk Muthoharoh,Ardika Satria,Martin C.T. Manullang

类目:Computation and Language (cs.CL)

关键词:e-commerce platforms plays, automatically understanding customer, understanding customer satisfaction, providing actionable insights, improve product quality

备注: 6 pages, 2 figures. Benchmarking study comparing PyCaret-based machine learning models (Logistic Regression, SVM, LightGBM) with a BiLSTM+Attention model for sentiment analysis on Indonesian product reviews

点击查看摘要

Abstract:Sentiment analysis of product reviews on e-commerce platforms plays a critical role in automatically understanding customer satisfaction and providing actionable insights for sellers seeking to improve product quality. This paper presents a comprehensive benchmarking study comparing a Machine Learning (ML) approach via the PyCaret AutoML framework against a Deep Learning (DL) approach based on a Bidirectional Long Short-Term Memory (BiLSTM) architecture with an Attention mechanism for binary sentiment classification on Indonesian product reviews. The dataset comprises 19,728 samples balanced equally between positive and negative reviews. For the ML approach, three prominent algorithms were evaluated via 10-fold stratified cross-validation: Logistic Regression (LR), Support Vector Machine (SVM) with a linear kernel, and Light Gradient Boosting Machine (LightGBM). Logistic Regression achieved the best ML performance with an accuracy of 97.26\% and an F1-score of 97.26\%. The BiLSTM with Attention model, evaluated on 3,946 held-out test samples, achieved an accuracy of 97.24\% and an F1-score of 97.24\%. These comparative results demonstrate that traditional ML algorithms with proper preprocessing and feature extraction can compete closely with, and even marginally outperform, more complex sequential DL architectures on high-dimensional datasets, while simultaneously offering greater computational efficiency.

32. 【2604.25448】Navigating Global AI Regulation: A Multi-Jurisdictional Retrieval-Augmented Generation System

链接https://arxiv.org/abs/2604.25448

作者:Courtney Ford,Ojas Rane,Susan Leavy

类目:Computation and Language (cs.CL)

关键词:average answer relevancy, difficult for policymakers, average faithfulness, average answer, increasingly difficult

备注: Preprint. Accepted at PoliticalNLP Workshop, LREC 2026. 10 pages, 1 figure

点击查看摘要

Abstract:Navigating AI regulation across jurisdictions is increasingly difficult for policymakers, legal professionals, and researchers. To address this, we present a multi-jurisdictional Retrieval-Augmented Generation system for global AI regulation. Our corpus includes 242 documents across 68 jurisdictions, ranging from formal legislation like the EU AI Act to unstructured policy documents such as national AI strategies. The system makes three technical contributions: type-specific chunking that preserve legal structure across heterogenous documents; conditional retrieval routing with entity detection and metadata for legal citations; and priority-based re-ranking to boost enacted legislation over policy and secondary sources. Evaluation of 50 queries reveals strong performance across both single-entity and multi-jurisdictional questions, achieving 0.87 average faithfulness and 0.84 average answer relevancy. Single-entity queries achieve 0.86 average faithfulness and 0.92 average answer relevancy, while multi-jurisdictional comparison queries achieve 0.88 average faithfulness and 0.75 average answer relevancy. These findings highlight the effectiveness of domain-specific retrieval strategies for navigating complex, heterogenous regulatory corpora.

33. 【2604.25444】One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

链接https://arxiv.org/abs/2604.25444

作者:Yixiao Zhou,Dongzhou Cheng,zhiliang wu,Yi Yang,Yu Cheng,Hehe Fan

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, ambiguous human inquiries, structured logic required, latent reasoning capabilities

备注: Accepted to ACL26

点击查看摘要

Abstract:Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by fine-tuning each model individually or rely on static prompts that fail to resolve query-level structural complexity. In this paper, we propose ReQueR (\textbf{Re}inforcement \textbf{Que}ry \textbf{R}efinement), a modular framework that treats reasoning elicitation as an inference-time alignment task. We train a specialized Refiner policy via Reinforcement Learning to rewrite raw queries into explicit logical decompositions, treating frozen LLMs as the environment. Rooted in the classical Zone of Proximal Development from educational psychology, we introduce the Adaptive Solver Hierarchy, a curriculum mechanism that stabilizes training by dynamically aligning environmental difficulty with the Refiner's evolving competence. ReQueR yields consistent absolute gains of 1.7\%--7.2\% across diverse architectures and benchmarks, outperforming strong baselines by 2.1\% on average. Crucially, it provides a promising paradigm for one-to-many inference-time reasoning elicitation, enabling a single Refiner trained on a small set of models to effectively unlock reasoning in diverse unseen models. Code is available at this https URL.

34. 【2604.25441】Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

链接https://arxiv.org/abs/2604.25441

作者:Venkata Pushpak Teja Menta

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:TTS systems produce, measured phonological dimensions, produce near-native Indic, Commercial TTS systems, systems produce near-native

备注: 9 pages, 6 figures, 6 tables. Companion paper to PSP benchmark. Code: [this https URL](https://github.com/praxelhq/praxy) ; Model: [this https URL](https://huggingface.co/Praxel/praxy-voice-r6) ; Demo: [this https URL](https://huggingface.co/spaces/Praxel/praxy-voice-demo)

点击查看摘要

Abstract:Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

35. 【2604.25423】Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

链接https://arxiv.org/abs/2604.25423

作者:Yu Wang,Emmanuele Chersoni,Chu-Ren Huang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:acquire embodied cognition, large language models, acquire embodied, embodied cognition, large language

备注: Accepted to ACL 2026

点击查看摘要

Abstract:Do large language models (LLMs) truly acquire embodied cognition and cultural conventions from text? We introduce demonstratives, fundamental spatial expressions like "this/that" in English and "zhè/nà" in Chinese, as a novel probe for grounded knowledge. Using 6,400 responses from 320 native speakers, we establish a human baseline: English speakers reliably distinguish proximal-distal referents but struggle with perspective-taking, while Chinese speakers switch perspectives fluently but tolerate distal ambiguity. In contrast, five state-of-the-art LLMs fail to inherently understand the proximal-distal contrast and show no cultural differences, defaulting to English-centric reasoning. Our study contributes (i) a new task, based on demonstratives, as a new lens for evaluating embodied cognition and cultural conventions; (ii) empirical evidence of cross-cultural asymmetries in human interpretation; (iii) a new perspective on the egocentric-sociocentric debate, showing both orientations coexist but vary across languages; and (iv) a call to address individual variation in future model design.

36. 【2604.25409】Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

链接https://arxiv.org/abs/2604.25409

作者:Penghao Kuang,Haoyi Wu,Kewei Tu

类目:Computation and Language (cs.CL)

关键词:contextual word representation, medium sized datasets, demonstrated substantial similarity, downstream task performance, Maximal Update Parametrization

备注

点击查看摘要

Abstract:Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has demonstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently. In this work, we follow Maximal Update Parametrization (muP) to rescale PT's parameters, so that hyperparameters optimized on small models can be transferred to larger models without additional tuning. With this approach, we successfully scale PT to models with up to 0.4B parameters. Experiments show that PT consistently outperforms standard transformer under the same parameter budget on Masked Language Modeling (MLM) tasks. We hope this work will contribute to the practical deployment of probabilistic models at substantially larger scales in the future.

37. 【2604.25392】Benchmarking PyCaret AutoML Against IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian IKN Twitter Data

链接https://arxiv.org/abs/2604.25392

作者:Mutia Alfi Mayzaroh,Dwi Fitria Ningsih,Nindi Destriani,Martin C.T. Manullang

类目:Computation and Language (cs.CL)

关键词:Ibu Kota Nusantara, Indonesian-language Twitter comments, learning approach based, Twitter comments related, Kota Nusantara

备注: 10 pages, 5 figures, 4 tables. Presented as a benchmarking study on Indonesian sentiment analysis using PyCaret and IndoBERT

点击查看摘要

Abstract:This paper benchmarks a classical machine learning approach based on PyCaret AutoML against a deep learning approach based on IndoBERT fine-tuning for binary sentiment analysis of Indonesian-language Twitter comments related to Ibu Kota Nusantara (IKN). The dataset contains 1,472 manually labeled samples, consisting of 780 negative and 692 positive comments. In the machine learning setting, Logistic Regression, Naive Bayes, and Support Vector Machine were evaluated using 10-fold cross-validation, with Logistic Regression achieving the best performance among the classical models at 77.57% accuracy and 77.17% F1-score. In the deep learning setting, the indobenchmark/indobert-base-p1 model was fine-tuned for five epochs and achieved 89.59% test accuracy and 89.37% F1-score. The results show that IndoBERT substantially outperforms the machine learning baselines, highlighting the effectiveness of Transformer-based contextual representations for informal Indonesian social media text.

38. 【2604.25384】Wiki Dumps to Training Corpora: South Slavic Case

链接https://arxiv.org/abs/2604.25384

作者:Mihailo Škorić

类目:Computation and Language (cs.CL)

关键词:transforming raw Wikimedia, raw Wikimedia dumps, Wikimedia dumps, raw Wikimedia, South Slavic

备注

点击查看摘要

Abstract:This paper presents a methodology for transforming raw Wikimedia dumps into quality textual corpora for seven South Slavic languages. The work is divided into two major phases. The first involves extracting and cleaning text from raw dumps of Wikipedia, Wikisource, Wikibooks, Wikinews, and Wikiquote, where available. This step requires careful handling of raw wiki markup to isolate, first of all, textual articles, and then usable natural language text within them. The second phase addresses the challenge of suspicious or low-quality articles, which are often generated from databases or structured knowledge bases. These articles are characterised by repetitive patterns, generic phrasing, and minimal to no original content. To mitigate their impact, a n-gram-based filtering strategy was employed to detect high levels of textual redundancy between articles and then remove such articles from the corpora entirely. The resulting datasets aim to provide linguistically rich texts suitable for training language models or conducting comparative research across South Slavic languages. By combining systematic extraction with quality control, this work contributes to the creation of reliable, high-information corpora that reflect authentic language use and cultural context. While focused on the South Slavic case in the paper, the approach is mostly language-agnostic and can be generalised to other languages and language families.

39. 【2604.25374】Language corpora for the Dutch medical domain

链接https://arxiv.org/abs/2604.25374

作者:B. van Es

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:limiting NLP development, Dutch medical, textbf, Dutch medical corpora, Dutch medical resources

备注: 11 pages, no figures

点击查看摘要

Abstract:\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ 35 billion tokens across the medical domain in about 100 million documents, freely available on Hugging Face. \\ \textbf{Conclusion:} This work establishes the first large-scale Dutch medical language corpus for pre-training and downstream NLP tasks.

40. 【2604.25359】he Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

链接https://arxiv.org/abs/2604.25359

作者:Abhinav Kumar Singh,Harsha Vardhan Khurdula,Yoeven D Khemlani,Vineet Agarwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, extract structured data, Structured Output Benchmark, parsing invoices

备注: 19 pages, 4 figures, 11 tables, submitted to NeurIPS 2026

点击查看摘要

Abstract:Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.

41. 【2604.25325】R$^3$-SQL: Ranking Reward and Resampling for Text-to-SQL

链接https://arxiv.org/abs/2604.25325

作者:Hojae Han,Yeonseok Jeong,Seung-won Hwang,Zhewei Yao,Yuxiong He

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:systems generate multiple, SQL, generate multiple candidate, systems generate, final prediction

备注: Accepted by Findings of ACL 2026

点击查看摘要

Abstract:Modern Text-to-SQL systems generate multiple candidate SQL queries and rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalent SQL queries inconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R$^3$-SQL, a Text-to-SQL framework that addresses both issues through unified reward for ranking and resampling. R$^3$-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines a pairwise preference across groups with a pointwise utility from the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R$^3$-SQL introduces agentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent. R$^3$-SQL achieves 75.03 execution accuracy on BIRD-dev, a new state of the art among methods using models with disclosed sizes, with consistent gains across five benchmarks.

42. 【2604.25318】Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

链接https://arxiv.org/abs/2604.25318

作者:Lanshan He,Haozhou Pang,Qi Gan,Xin Shen,Ziwei Zhang,Yibo Liu,Gang Fang,Bo Liu,Kai Sheng,Shengfeng Zeng,Chaofan Li,Zhen Hui,Keer Zhou,Lan Zhou,Shujun Dai

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:carefully choreographed cinematic, choreographed cinematic sequences, cinematic sequences embedded, interactive media, narrative delivery

备注: 27 pages excluding appendix

点击查看摘要

Abstract:Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of collaborative effort from multidisciplinary teams to produce minutes of polished content. In this work, we present Cutscene Agent, an LLM agent framework for automated end-to-end cutscene generation. The framework makes three contributions: (1)~a Cutscene Toolkit built on the Model Context Protocol (MCP) that establishes \emph{bidirectional} integration between LLM agents and the game engine -- agents not only invoke engine operations but continuously observe real-time scene state, enabling closed-loop generation of editable engine-native cinematic assets; (2)~a multi-agent system where a director agent orchestrates specialist subagents for animation, cinematography, and sound design, augmented by a visual reasoning feedback loop for perception-driven refinement; and (3)~CutsceneBench, a hierarchical evaluation benchmark for cutscene generation. Unlike typical tool-use benchmarks that evaluate short, isolated function calls, cutscene generation requires long-horizon, multi-step orchestration of dozens of interdependent tool invocations with strict ordering constraints -- a capability dimension that existing benchmarks do not cover. We evaluate a range of LLMs on CutsceneBench and analyze their performance across this challenging task.

43. 【2604.25313】Faithfulness-QA: A Counterfactual Entity Substitution Dataset for Training Context-Faithful RAG Models

链接https://arxiv.org/abs/2604.25313

作者:Li Ju,Junzhe Wang,Qi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:frequently produce answers, produce answers grounded, Retrieval-Augmented Generation, models frequently produce, undermining the core

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) models frequently produce answers grounded in parametric memory rather than the retrieved context, undermining the core promise of retrieval augmentation. A fundamental obstacle to fixing this unfaithfulness is the lack of training data that explicitly requires models to prefer context over internal knowledge. We introduce Faithfulness-QA, a large-scale dataset of 99,094 samples constructed through counterfactual entity substitution. Starting from two established extractive QA benchmarks--SQuAD and TriviaQA--we automatically identify answer-bearing named entities in each context, replace them with type-consistent alternatives drawn from a curated bank of 76,953 entities, and thereby manufacture controlled knowledge conflicts between context and parametric memory. Rigorous quality filtering ensures 100% pass rates across four automated checks on random 200-sample audits. We release the full dataset, the construction pipeline, and a typed entity bank covering eight named entity categories. Faithfulness-QA is designed as a training resource for attention-based faithfulness objectives and as an evaluation benchmark for measuring context-grounding behavior in RAG systems. Data and code are available at this https URL.

44. 【2604.25297】LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

链接https://arxiv.org/abs/2604.25297

作者:Youngjoon Jang,Chanhee Park,Hyeonseok Moon,Young-kyoung Ham,Jiwon Moon,Jinhyeon Kim,JuKyung Jung,Heuiseok Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, turn general-purpose models, open-source large language, recent years, language models

备注: ICLR 2026 DATA-FM Workshop

点击查看摘要

Abstract:In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introduce LegalMidm, a Korean legal-domain LLM, and present a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. Our approach emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, and demonstrates effectiveness in key legal tasks.

45. 【2604.25296】Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

链接https://arxiv.org/abs/2604.25296

作者:Jianghang Lin,Haihua Yang,Deli Yu,Kai Wu,Kai Ye,Jinghao Lin,Zihan Wang,Yuhang Wu,Liujuan Cao

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to capture the hierarchical and interconnected nature of clinical medical knowledge, limiting the models' ability to perform fine-grained recognition and complex reasoning. In this paper, we propose a novel Entity-Centric Medical Data Engineering framework. We automatically extract entities from authoritative medical literature to construct a Medical Entity Tree (MET), a hierarchical structure that systematically encodes diseases, anatomical structures, modalities, and symptoms into a unified knowledge repository. Building upon the MET, we propose an advanced data engine that includes: (1) node-guided retrieval to anchor raw data to specific medical concepts, (2) a two-stage hybrid filtering and alignment pipeline to ensure precise visual-semantic correspondence, and (3) knowledge-aware data synthesis to generate enriched captions and targeted reasoning VQA pairs, leveraging structural constraints. Extensive evaluations across six medical benchmarks demonstrate that our approach significantly enhances the medical capabilities of general-purpose MLLMs, improving their ability to handle complex clinical queries and achieve state-of-the-art performance in diverse medical contexts.

46. 【2604.25249】Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

链接https://arxiv.org/abs/2604.25249

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:open problem, sandbagging instruction, sandbagging, capability evaluations, below-chance performance

备注: 10 pages, 2 figures, 2 tables. Pre-registered: [this https URL](https://osf.io/6zftv/)

点击查看摘要

Abstract:Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model's preferred position. An explicit anti-task instruction ("pick the least likely answer") drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by "deliberately underperform." BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.

47. 【2604.25235】VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

链接https://arxiv.org/abs/2604.25235

作者:Divake Kumar,Sina Tayebati,Devashri Naik,Ranganath Krishnan,Amit Ranjan Trivedi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Vision-language models, provide no indication, Vision-language, multimodal systems, automated judges

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: this https URL

48. 【2604.25231】DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

链接https://arxiv.org/abs/2604.25231

作者:Anirudh Iyengar Kaniyar Narayana Iyengar,Tampu Ravi Kumar,Gaurav Najpande,Manan Suri,Dinesh Manocha,Puneet Mathur,Vivek Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:interpret structured visual, structured visual representations, circuit schematics, DQA, interpret structured

备注: 22 Pages, 14 Figures

点击查看摘要

Abstract:Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.

49. 【2604.25203】BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

链接https://arxiv.org/abs/2604.25203

作者:Arnon Mazza,Elad Levi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:capture task-specific requirements, high inference costs, inconsistent boundary-case performance, policies remains challenging, safety models fail

备注

点击查看摘要

Abstract:Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

50. 【2604.25182】CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.25182

作者:Rui Qi,Fengran Mo,Sijin Lu,Yufeng Chen,Jian-Yun Nie,Kaiyu Huang

类目:Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation, Relative Policy Optimization, Group Relative Policy, supplement and correct, correct the facts

备注: Accepted to SIGIR 2026 (Short Paper)

点击查看摘要

Abstract:A multilingual collection may contain useful knowledge in other languages to supplement and correct the facts in the original language for Retrieval-Augmented Generation (RAG). However, the vanilla approach that simply concatenates multiple pieces of knowledge from different languages into the context may fail to improve effectiveness due to the potential disparities across languages. To better leverage multilingual knowledge, we propose CroSearch-R1, a search-augmented reinforcement learning framework to integrate multilingual knowledge into the Group Relative Policy Optimization (GRPO) process. In particular, the approach adopts a multi-turn retrieval strategy with cross-lingual knowledge integration to dynamically align the knowledge from other languages as supplementary evidence into a unified representation space. Furthermore, we introduce a multilingual rollout mechanism to optimize reasoning transferability across languages. Experimental results demonstrate that our framework effectively leverages cross-lingual complementarity and improves the effectiveness of RAG with multilingual collections.

51. 【2604.25152】MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors

链接https://arxiv.org/abs/2604.25152

作者:Yuanfan Li,Qi Zhou,Chengzhengxu Li,Zhaohan Zhang,Chenxu Zhao,Zepu Ruan,Chao Shen,Xiaoming Liu

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:present MGTEVAL, systematic evaluation, Machine-Generated Text, MGT, Dataset Building

备注

点击查看摘要

Abstract:We present MGTEVAL, an extensible platform for systematic evaluation of Machine-Generated Text (MGT) detectors. Despite rapid progress in MGT detection, existing evaluations are often fragmented across datasets, preprocessing, attacks, and metrics, making results hard to compare and reproduce. MGTEVAL organizes the workflow into four components: Dataset Building, Dataset Attack, Detector Training, and Performance Evaluation. It supports constructing custom benchmarks by generating MGT with configurable LLMs, applying 12 text attacks to test sets, training detectors via a unified interface, and reporting effectiveness, robustness, and efficiency. The platform provides both command-line and Web-based interfaces for user-friendly experimentation without code rewriting.

52. 【2604.25136】Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment

链接https://arxiv.org/abs/2604.25136

作者:James Pustejovsky,Nikhil Krishnaswamy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Frictive Policy Optimization, Policy Optimization, language model policies, propose Frictive Policy, learning language model

备注: Frictive Policy Optimization; epistemic alignment; risk-sensitive control; LLM alignment; clarification and refusal; preference learning; trust regions; dialogue agents

点击查看摘要

Abstract:We propose Frictive Policy Optimization (FPO), a framework for learning language model policies that regulate not only what to say, but when and how to intervene in order to manage epistemic and normative risk. Unlike standard alignment methods that optimize surface-level preference or task utility, FPO treats clarification, verification, challenge, redirection, and refusal as explicit control actions whose purpose is to shape the evolution of belief, commitment, and uncertainty over time. We formalize alignment as a risk-sensitive epistemic control problem in which intervention decisions are selected based on their expected effect on downstream epistemic quality rather than on immediate reward alone. We introduce a compact taxonomy of frictive interventions, a structured friction functional that operationalizes multiple alignment failure modes, and a unified family of FPO methods spanning reward shaping, preference pairing, group-relative ranking, and risk-conditioned trust regions. We further propose an evaluation framework that measures epistemic competence directly through clarification behavior, calibration, contradiction repair, refusal proportionality, and information efficiency. Together, these results provide a formal and algorithmic foundation for learning agents that are aligned not only in outcome, but in epistemic conduct.

53. 【2604.25135】FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments

链接https://arxiv.org/abs/2604.25135

作者:Amir Saeidi,Venkatesh Mishra,Souradeep Mukhopadhyay,Gaowen Liu,Ali Payani,Jayanth Srinivasa,Chitta Baral

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, autonomous agents capable, external environments

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.

54. 【2604.25133】Korean aegyo speech shows systematic F1 increase to signal childlike qualities

链接https://arxiv.org/abs/2604.25133

作者:Ji-eun Kim,Volker Dellwo

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:twelve Seoul Korean, socially recognized childlike, recognized childlike speaking, Seoul Korean speakers, Seoul Korean

备注: 18 pages, 2 figures, under review

点击查看摘要

Abstract:Korean aegyo is a socially recognized childlike speaking style used predominantly in romantic interactions among adults. This study examined vowel space modification in aegyo by analyzing formant frequencies from twelve Seoul Korean speakers who produced identical scripts in aegyo and non-aegyo styles. Results show that aegyo speech features a significant increase in F1 values across vowels and selective fronting of front vowels, leading to vowel space expansion but mainly a shift to higher F1. These findings suggest that adult speakers stylize childlike speech by imitating the shorter vocal tract of children, mainly through global vowel lowering and partial fronting.

55. 【2604.25132】What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

链接https://arxiv.org/abs/2604.25132

作者:Guangzeng Han,Xiaolei Huang

类目:Computation and Language (cs.CL)

关键词:Instruction-tuning datasets, necessitating effective data, substantial redundancy, redundancy and low-quality, in-context influence

备注: ACL 2026, main conference

点击查看摘要

Abstract:Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.

56. 【2604.25130】LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

链接https://arxiv.org/abs/2604.25130

作者:Huyen Nguyen,Haoxuan Zhang,Yang Zhang,Haihua Chen,Junhua Ding

类目:Computation and Language (cs.CL)

关键词:Evaluating long document, long document summaries, document summaries remains, Evaluating long, summarization research

备注: 13 pages, 3 figures

点击查看摘要

Abstract:Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human judgments compared to established metrics. Structured feedback enables significant quality improvements through self-refinement without retraining. By demonstrating that evaluation feedback can serve as executable instructions for generation, this work establishes a generalizable paradigm for aligning assessment with improvement, with direct implications for controllable text generation requiring verifiable accuracy and transparent quality control. All code and datasets will be released in GitHub for reproducibility.

57. 【2604.25120】Diagnosis, Bad Planning Reasoning. Treatment, SCOPE -- Planning for Hybrid Querying over Clinical Trial Data

链接https://arxiv.org/abs/2604.25120

作者:Suparno Roy Chowdhury,Manan Roy Choudhury,Tejas Anvekar,Muhammad Ali Khan,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:lightweight domain reasoning, study clinical trial, directly stored, stored in visible, visible cells

备注

点击查看摘要

Abstract:We study clinical trial table reasoning, where answers are not directly stored in visible cells but must be reasoned from semantic understanding through normalization, classification, extraction, or lightweight domain reasoning. Motivated by the observation that current LLM approaches often suffer from "bad reasoning" under implicit planning assumptions, we focus on settings in which the model must recover implicit attributes such as therapy type, added agents, endpoint roles, or follow-up status from partially observed clinical-trial tables. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, Blend-SQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution

58. 【2604.25098】Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

链接https://arxiv.org/abs/2604.25098

作者:Ocean Monjur,Shahriar Kabir Nahin,Anshuman Chhabra

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, current Large Language, Large Language, Language Models, test-time compute scaling

备注

点击查看摘要

Abstract:While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer blocks), significantly degrades TTS reasoning performance. In this work, we revisit this assumption and instead investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating unstructured pruning methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can improve TTS effectiveness even further.

59. 【2604.25096】he Dynamics of Delusion: Modeling Bidirectional False Belief Amplification in Human-Chatbot Dialogue

链接https://arxiv.org/abs/2604.25096

作者:Ashish Mehta,Jared Moore,Jacy Reese Anthis,William Agnew,Eric Lin,Peggy Yin,Desmond C. Ong,Nick Haber,Carol Dweck

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:fuel delusional beliefs, growing concern, chatbots, humans, fuel delusional

备注

点击查看摘要

Abstract:There is growing concern that AI chatbots might fuel delusional beliefs in users. Some have suggested that humans and chatbots mutually reinforce false beliefs over time, but quantitative evidence is lacking. Using a unique dataset of chat logs from individuals who exhibited delusional thinking, we developed a latent state model that captures accumulating and decaying influences between humans and chatbots. We find that a bidirectional influence model substantially outperforms a unidirectional alternative where humans are the primary driver of delusion. We find that humans exert strong but short-lived influence on chatbots, whereas chatbots exert longer-lasting influence on humans. Moreover, chatbots exert strong, stable self-influence over their own future outputs that tends to perpetuate delusions over long stretches of conversation. In fact, this chatbot self-influence constituted the dominant pathway when considering accumulated influence over time. Overall, these results indicate that humans tend to drive sharp, immediate increases in delusion, whereas chatbots sustain and propagate these effects over longer timescales. Together, these findings provide the first quantitative evidence that human-chatbot interactions can form feedback loops of delusion, decomposable into distinct pathways with dissociable temporal dynamics. By doing so, they can inform the development of safer AI systems.

60. 【2604.25088】Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

链接https://arxiv.org/abs/2604.25088

作者:Abigail O'Neill,Alan Zhu,Mihran Miroyan,Narges Norouzi,Joseph E. Gonzalez

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Language Model, long-term competitive goals, remain largely untested, leverage short-term cooperation, based agents remain

备注

点击查看摘要

Abstract:Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective. Players have asymmetric objectives and negotiations are non-binding, allowing alliances to form and break as players' short-term interests align and diverge. We run AI only games and conduct a user study pitting human players against AI opponents. We identify significant differences between human and AI negotiation behaviors, finding that humans favor lower-complexity deals and are significantly less reliable partners compared to LM-based agents. We also find that humans are more aggressive negotiators, accepting deals without a counteroffer only 56.3% of the time compared to 67.6% for LM-based agents. Through targeted prompting inspired by these findings, we modify agents' negotiation behavior and improve win rates from 22.2% to 32.7%. We run over 1,100 games with over 16,000 private conversations totaling 15.2 million tokens and over 150,000 player actions. Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments. The game, code, and dataset may be found at this https URL.

61. 【2604.25053】Analyzing LLM Reasoning to Uncover Mental Health Stigma

链接https://arxiv.org/abs/2604.25053

作者:Sreehari Sankar,Aliakbar Nafar,Mona Barman,Hannah K. Heitz,Ashwin Kumar,Pouria Tohidi,Dailun Li,Danish Hussain,Russell DuBois,Hamed Hasheminia,Farshad Majzoubi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:recent studies reveal, recent studies, increasingly being explored, studies reveal, mental health applications

备注

点击查看摘要

Abstract:While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models' underlying logic. In this paper, we analyze the intermediate reasoning steps of LLMs to uncover hidden stigmatizing language and the internal rationales driving it. We leverage clinical expertise to categorize common patterns of stigmatizing language directed at individuals with psychological conditions and use this framework to identify and tag problematic statements in LLM reasoning. Furthermore, we rate the severity of these statements, distinguishing between overt prejudice and more subtle, less immediately harmful biases. To broaden the reasoning domain and capture a wider array of patterns, we also extend an existing mental health stigma benchmark by incorporating additional psychological conditions. Our findings demonstrate that evaluating model reasoning not only exposes substantially more stigma than traditional MCQ-based methods but it helps to identify the flaws in the LLMs' logic and their understanding of mental health conditions.

62. 【2604.25040】Leverage Laws: A Per-Task Framework for Human-Agent Collaboration

链接https://arxiv.org/abs/2604.25040

作者:Stan Loosmore

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:resolve mid-run interrupts, human time required, human-agent collaboration, resolve mid-run, mid-run interrupts

备注: 10 pages, 2 figures

点击查看摘要

Abstract:We propose a per-task leverage ratio for human-agent collaboration: human work displaced by an agent, divided by the human time required to specify the task, resolve mid-run interrupts, and review the result. The denominator decomposes into three channels through which a conserved per-task information requirement must flow, each with its own time-cost scalar. We show that information density itself is directional and bounded by separate ceilings on human-to-agent and agent-to-human flow, and that the asymptotic behavior of leverage decomposes into two scaling axes (capability and memory) with a non-zero floor on the planning term set by irreducible task novelty bounded by human throughput. We extend this per-task analysis to a windowed leverage measure that accommodates recurring tasks, spawned subtasks, and amortized system-design investment. The per-task ceiling does not bind the windowed measure, though both remain bounded: $L_{\text{task}}$ by per-task novelty, $L_{\text{window}}$ by the stock of accumulated planning investment that pays out within the window. The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio, and produces a list of testable empirical questions that we leave as open problems.

63. 【2604.25039】Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

链接https://arxiv.org/abs/2604.25039

作者:Sagnik Chatterjee,Atharva Patil,Sricharan Ramesh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Small Language Models, Large Language Models, Language Models, struggle with multi-step, tight compute

备注

点击查看摘要

Abstract:Large Language Models (LLMs) solve many reasoning tasks via chain-of-thought (CoT) prompting, but smaller models (about 7 to 8B parameters) still struggle with multi-step reasoning under tight compute and token budgets. Existing test time reasoning methods such as self consistency (sampling multiple rationales and voting), Tree-of-Thoughts (search over intermediate thoughts), and critique revise loops improve performance, but often at high token cost and without fine-grained step-level control. This project1 aims to address that gap: can Small Language Models (SLMs) reason reliably using the same or fewer tokens? This question is both scientific and practical. Scientifically, it probes whether process supervision and simple test-time controls (such as token budgets and rejection of redundant steps) can substitute for model scale or large sampling counts. Practically, many deployments (on-device, low-latency, or cost-constrained settings) cannot afford huge models or dozens of sampled rationales per query. A method that improves SLM reasoning at fixed cost would therefore be directly useful.

64. 【2604.25031】Faithful Autoformalization via Roundtrip Verification and Repair

链接https://arxiv.org/abs/2604.25031

作者:Daneshvar Amrollahi,Jerry Lopez,Clark Barrett

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM formalizes natural, formalizes natural language, LLM formalizes, natural language, formalizes natural

备注

点击查看摘要

Abstract:When an LLM formalizes natural language, how do we know the output is faithful? We propose a roundtrip verification approach which does not require ground-truth annotations: formalize a statement, translate the result back to natural language, re-formalize, and use a formal tool to check logical equivalence. When the two formalizations agree, this provides evidence of a faithful formalization. When they disagree, a diagnosis step identifies which translation stage failed, and a targeted repair operator attempts to correct that stage. We evaluate our approach on 150 traffic rules using Claude Opus 4.6 and GPT-5.2. Diagnosis-guided repair raises formal equivalence from 45--61% to 83--85% for both models, outperforming a random-repair baseline. An independent NLI analysis confirms that formal equivalence is correlated with less semantic drift.

65. 【2604.25011】Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

链接https://arxiv.org/abs/2604.25011

作者:Dan Shi,Zhuowen Han,Simon Ostermann,Renren Jin,Josef van Genabith,Deyi Xiong

类目:Computation and Language (cs.CL)

关键词:general capabilities forgetting, Reinforcement learning, large language models, supervised fine-tuning, frequently leads

备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Reinforcement learning (RL)-based post-training often improves the reasoning performance of large language models (LLMs) beyond the training domain, while supervised fine-tuning (SFT) frequently leads to general capabilities forgetting. However, the mechanisms underlying this contrast remain unclear. To bridge this gap, we present a feature-level mechanistic analysis methodology to probe RL generalization using a controlled experimental setup, where RL- and SFT-tuned models are trained from the same base model on identical data. Leveraging our interpretability framework, we align internal activations across models within a shared feature space and analyze how features evolve during post-training. We find that SFT rapidly introduces many highly specialized features that stabilize early in training, whereas RL induces more restrained and continually evolving feature changes that largely preserve base models' representations. Focusing on samples where RL succeeds but the base model fails, we identify a compact, task-agnostic set of features that directly mediate generalization across diverse tasks. Feature-level interventions confirm their causal role: disabling these features significantly degrades RL models' generalization performance, while amplifying them improves base models' performance. The code is available at this https URL.

66. 【2604.24978】Dont Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination

链接https://arxiv.org/abs/2604.24978

作者:Prafulla Kumar Choubey,Kung-Hsiang Huang,Pranav Narayanan Venkit,Jiaxin Zhang,Vaibhav Vats,Yu Li,Xiangyu Peng,Chien-Sheng Wu

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Enterprise deep research, produce decision-ready reports, decision-ready reports due, uneven information coverage, scalable Enterprise Deep

备注: ACL Industry 2026

点击查看摘要

Abstract:Enterprise deep research often fails to produce decision-ready reports due to uneven information coverage, context explosion, and premature stopping. We propose a scalable Enterprise Deep Research (EDR) architecture to address these failures. Our system (i) decomposes requests into coverage-driven objectives via outline generation with reflection, (ii) localizes context with dependency-guided execution and explicit information sharing, and (iii) enforces evidence-based completion criteria so agents iteratively collect information until sufficiency conditions are met. We evaluate on an internal sales enablement task and the public DeepResearch Bench benchmark, where our proposed system design achieves the strongest overall performance compared with competitive deep-research baselines. The results show that dependency-controlled context and explicit evidence sufficiency criteria reduce premature stopping and improve the consistency and depth of enterprise research outputs.

67. 【2604.24977】A Survey on LLM-based Conversational User Simulation

链接https://arxiv.org/abs/2604.24977

作者:Bo Ni,Leyao Wang,Yu Wang,Branislav Kveton,Franck Dernoncourt,Yu Xia,Hongjie Chen,Reuben Leura,Samyadeep Basu,Subhojyoti Mukherjee,Puneet Mathur,Nesreen Ahmed,Junda Wu,Li Li,Huixin Zhang,Ruiyi Zhang,Tong Yu,Sungchul Kim,Jiuxiang Gu,Zhengzhong Tu,Alexa Siu,Zichao Wang,David Seunghyun Yoon,Nedim Lipka,Namyong Park,Zihao Lin,Trung Bui,Yue Zhao,Tyler Derr,Ryan A. Rossi

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:computer science due, range of applications, long played, played a vital, vital role

备注: Submitted in August 2025. MOD-81000 approved survey

点击查看摘要

Abstract:User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.

68. 【2604.24972】Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

链接https://arxiv.org/abs/2604.24972

作者:Jun Li,Mingxuan Liu,Jiazhen Pan,Che Liu,Wenjia Bai,Cosmin I. Bercea,Julia A. Schnabel

类目:Computation and Language (cs.CL)

关键词:Clinical abnormality grounding, inference highly unstable, single-pass inference highly, Clinical abnormality, Dynamic Decision Learning

备注

点击查看摘要

Abstract:Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: this https URL

69. 【2604.24971】PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

链接https://arxiv.org/abs/2604.24971

作者:Ishan Patel,Ishan Joshi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:multiple concurrent inference, inference agents share, concurrent inference agents, Fast Walsh-Hadamard Transform, asymmetrically compressed

备注: 10 pages, 6 tables. Code: [this https URL](https://github.com/ishan1410/PolyKV) Keywords: KV cache compression, multi-agent LLM inference, asymmetric quantization, FWHT, TurboQuant, shared memory

点击查看摘要

Abstract:We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

70. 【2604.24964】Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

链接https://arxiv.org/abs/2604.24964

作者:Lawrence Keunho Jang,Jing Yu Koh,Daniel Fried,Ruslan Salakhutdinov

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Existing web agent, Existing web, converged on short, largely converged, approaching saturation

备注: 29 pages

点击查看摘要

Abstract:Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at this https URL

71. 【2604.24955】BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

链接https://arxiv.org/abs/2604.24955

作者:Xinming Tu,Tianze Wang,Yingzhou(Minta)Lu,Kexin Huang,Yuanhao Qu,Sara Mostafavi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词:valid alternative approaches, penalize valid alternative, apparent agent failures, rigid evaluation scripts, broken specifications

备注

点击查看摘要

Abstract:As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.

72. 【2604.24942】Independent-Component-Based Encoding Models of Brain Activity During Story Comprehension

链接https://arxiv.org/abs/2604.24942

作者:Kamya Hari,Taha Binhuraib,Jin Li,Cory Shain,Anna A. Ivanova

类目:Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)

关键词:traditional voxelwise approaches, voxels encoding overlapping, traditional voxelwise, Encoding models provide, provide a powerful

备注

点击查看摘要

Abstract:Encoding models provide a powerful framework for linking continuous stimulus features to neural activity; however, traditional voxelwise approaches are limited by measurement noise, inter-subject variability, and redundancy arising from spatially correlated voxels encoding overlapping neural signals. Here, we propose an independent component (IC)-based encoding framework that dissociates stimulus-driven and noise-driven signals in fMRI data. We decompose continuous fMRI data from naturalistic story listening into ICs using one subset of the data, and train encoding models on independent data to predict IC time series from large language model representations of linguistic input. Across subjects, a subset of ICs exhibited consistently high predictivity. These ICs were spatially and temporally consistent across subjects and included cognitive networks known to respond during story listening (auditory and language). Auditory component time series were strongly correlated with acoustic stimulus features, highlighting the interpretability of identified component time series. Components identified as noise or motion-related artifacts by ICA-AROMA showed uniformly poor predictive performance, confirming that highly predicted components reflect genuine stimulus-related neural signals rather than confounds. Overall, IC-based encoding models enable analyses at the level of functional networks, accommodating the variability in network locations across individuals and providing interpretable results that are easy to compare across subjects.

73. 【2604.24940】ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

链接https://arxiv.org/abs/2604.24940

作者:Orhan Demirci,Sezer Aptourachman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:creating representational bottlenecks, natural language processing, limiting semantic expressiveness, traditional approaches represent, creating representational

备注: 13 pages (9 pages main text + 4 pages appendix), 6 tables, 1 algorithm

点击查看摘要

Abstract:Word embeddings are fundamental to natural language processing, yet traditional approaches represent each word with a single vector, creating representational bottlenecks for polysemous words and limiting semantic expressiveness. While multi-anchor representations have shown promise by representing words as combinations of multiple vectors, they have been limited to small-scale models due to computational inefficiency and lack of integration with modern transformer architectures. We introduce Adaptive Dictionary Embeddings (ADE), a framework that successfully scales multi-anchor word representations to large language models. ADE makes three key contributions: (1) Vocabulary Projection (VP), which transforms the costly two-stage anchor lookup into a single efficient matrix operation; (2) Grouped Positional Encoding (GPE), a novel positional encoding scheme where anchors of the same word share positional information, preserving semantic coherence while enabling anchor-level variation; and (3) context-aware anchor reweighting, which leverages self-attention to dynamically compose anchor contributions based on sequence context. We integrate these components into the Segment-Aware Transformer (SAT), which provides context-aware reweighting of anchor contributions at inference time. We evaluate ADE on AG News and DBpedia-14 text classification benchmarks. With 98.7% fewer trainable parameters than DeBERTa-v3-base, ADE surpasses DeBERTa on DBpedia-14 (98.06% vs. 97.80%) and approaches it on AG News (90.64% vs. 94.50%), while compressing the embedding layer over 40x -- demonstrating that multi-anchor representations are a practical and parameter-efficient alternative to single-vector embeddings in modern transformer architectures.

74. 【2604.24938】Rethinking Layer Redundancy in Large Language Models: Calibration Objectives and Search for Depth Pruning

链接https://arxiv.org/abs/2604.24938

作者:Minkyu Kim,Vincent-Daniel Yun,Youngrae Kim,Youngjin Heo,Suin Cho,Seong-hun Kim,Woosang Lim,Gaeul Kwon

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:removing Transformer blocks, Depth pruning improves, Transformer blocks, Depth pruning, removing Transformer

备注: Preprint

点击查看摘要

Abstract:Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work has focused on importance criteria and search algorithms, often treating layer redundancy as an inherent structural property of pretrained networks. In contrast, we adopt a \emph{functional perspective}, where redundancy is jointly influenced by the model and the evaluation objective, suggesting that a universal ranking may not be sufficient. Through an empirical study across three LLM families, two calibration objectives, and seven search algorithms, we observe that different objectives yield qualitatively different redundant layers, and that perplexity and downstream accuracy rankings do not consistently align. Under a fixed objective, however, search algorithms tend to produce similar solutions. Overall, our results suggest that the calibration objective may play a more influential role than the choice of search algorithm, indicating that further attention to objective design could be beneficial.

75. 【2604.24929】GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

链接https://arxiv.org/abs/2604.24929

作者:Yunsu Kim,Kaden Uhlig,Joern Wuebker

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remain largely English-centric, largely English-centric, machine translation, limited post-editing, built with machine

备注

点击查看摘要

Abstract:Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package at this https URL. We also release the code used in our experiments at this https URL.

76. 【2604.24927】Large Language Models Explore by Latent Distilling

链接https://arxiv.org/abs/2604.24927

作者:Yuanhao Zeng,Ao Lu,Lufei Li,Zheng Zhang,Yexin Li,Kan Ren

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Generating diverse responses, surface-level lexical variation, yields surface-level lexical, limiting semantic exploration, Generating diverse

备注: 25 pages, 5 figures

点击查看摘要

Abstract:Generating diverse responses is crucial for test-time scaling of large language models (LLMs), yet standard stochastic sampling mostly yields surface-level lexical variation, limiting semantic exploration. In this paper, we propose Exploratory Sampling (ESamp), a decoding approach that explicitly encourages semantic diversity during generation. ESamp is motivated by the well-known observation that neural networks tend to make lower-error predictions on inputs similar to those encountered before, and incur higher prediction error on novel ones. Building on this property, we train a lightweight Distiller at test time to predict deep-layer hidden representations of the LLM from its shallow-layer representations to model the LLM's depth-wise representation transitions. During decoding, the Distiller continuously adapts to the mappings induced by the current generation context. ESamp uses the prediction error as a novelty signal to reweight candidate token extensions conditioned on the current prefix, thereby biasing decoding toward less-explored semantic patterns. ESamp is implemented with an asynchronous training--inference pipeline, with less than 5% worst case overhead (1.2% in the optimized release). Empirical results show that ESamp significantly boosts the Pass@k efficiency of reasoning models, showing superior or comparable performance to strong stochastic and heuristic baselines. Notably, ESamp achieves robust generalization across mathematics, science, and code generation benchmarks and breaks the trade-off between diversity and coherence in creative writing. Our code has released at: this https URL.

77. 【2604.24921】Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

链接https://arxiv.org/abs/2604.24921

作者:Yifei Wei,Linqing Zhong,Yi Liu,Yuxiang Lu,Xindong He,Maoqing Yao,Guanghui Ren

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:executable physical actions, high-level semantic instructions, generalist robotic manipulation, instructions into executable, executable physical

备注: Accepted to the Main Conference of ACL 2026. Project page: [this https URL](https://libra-vla.github.io/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment. Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.

78. 【2604.24804】Intrinsic Mutual Information as a Modulator for Preference Optimization

链接https://arxiv.org/abs/2604.24804

作者:Peng Liao,Peijia Zheng,Lingbo Li,Shangsong Liang,Lin Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, aligning Large Language, Language Models, Large Language, offer significant advantages

备注: ACL Findings 2026

点击查看摘要

Abstract:Offline preference optimization methods, such as Direct Preference Optimization (DPO), offer significant advantages in aligning Large Language Models (LLMs) with human values. However, achieving optimal performance with these methods typically involves additional hyperparameter tuning, resulting in substantial time overhead. Although prior work has proposed a range of improvements, these methods remain limited in effectiveness and have not fully eliminated reliance on hyperparameter tuning. In this work, we propose RMiPO, a lightweight and efficient framework for offline preference optimization. RMiPO leverages intrinsic Response-level Mutual information for Preference Optimization with hyperparameter modulation, dynamically decoupling preference contributions at negligible additional computational cost. Extensive experimental results demonstrate that RMiPO achieves consistently superior performance over existing methods while reducing training overhead by more than 15\%. Our code is available at this https URL.

79. 【2604.24770】Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR

链接https://arxiv.org/abs/2604.24770

作者:Minsik Lee,Seoi Hong,Chongmin Lee,Sieun Choi,Jian Kim,Jua Han,Jihie Kim

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:remains challenging due, automatic speech recognition, limited training data, elderly ASR, ASR

备注: 5 pages, 2 figures, under review at IEEE Signal Processing Letters

点击查看摘要

Abstract:Despite recent progress in automatic speech recognition (ASR), elderly ASR (EASR) remains challenging due to limited training data and the distinct acoustic and linguistic characteristics of elderly speech. In this work, we address data scarcity in EASR through a data augmentation pipeline that combines large language model (LLM)-based transcript paraphrasing with text-to-speech (TTS) synthesis. Given an elderly speech dataset, the LLM first generates elderly-contextual paraphrases of the original transcripts, and the TTS model then synthesizes corresponding speech using elderly reference speakers. The resulting synthetic audio-text pairs are merged with the original data to fine-tune Whisper without architectural modification. We further analyze the effects of augmentation ratio and reference-speaker composition in low-resource EASR. Experiments on English and Korean elderly speech datasets from speakers aged 70 and above show that the proposed method consistently improves performance over conventional augmentation baselines, achieving up to a 58.2% reduction in word error rate (WER) compared with the Whisper baseline.

80. 【2604.23698】Benchmarking Testing in Automated Theorem Proving

链接https://arxiv.org/abs/2604.23698

作者:Jongyoon Kim,Hojae Han,Seung-won Hwang

类目:Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)

关键词:correctness remains challenging, Recent advances, remains challenging, formal theorem proving, evaluating semantic correctness

备注: ACL 2026 Industry

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

81. 【2604.25591】Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

链接https://arxiv.org/abs/2604.25591

作者:Chun-Yi Kuan,Wei-Ping Huang,Hung-yi Lee

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Recent audio-aware large, overly confident outputs, audio-aware large language, demonstrated strong capabilities, frequently produce hallucinated

备注: Manuscript in progress

点击查看摘要

Abstract:Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.

信息检索

1. 【2604.25906】Make Any Collection Navigable: Methods for Constructing and Evaluating Hypergraph of Text

链接https://arxiv.org/abs/2604.25906

作者:Dean E. Alvarez,ChengXiang Zhai

类目:Information Retrieval (cs.IR)

关键词:hyperlinks enables flexible, enables flexible navigation, web page, enables flexible, reason the Web

备注

点击查看摘要

Abstract:One reason the Web is more useful than a simple collection of documents is that the structure created by hyperlinks enables flexible navigation from one web page to another. However, hyperlinks are typically created manually and cannot fully capture a corpus' implicit semantic structures. Is there a general way to make an arbitrary collection navigable? Recent work has formalized this problem generally as constructing a Hypergraph of Text (HoT), which provides a formal mathematical structure for supporting navigation and browsing. However, how to construct and evaluate a Hypergraph of Text remains a challenge. In this paper, we propose and study several methods for constructing a HoT. We also propose a novel quantitative metric, effort ratio, for evaluating the structural quality of a constructed HoT. Experimental results show that even simple TF-IDF baselines can match LLM-based methods on our proposed effort ratio metric.

2. 【2604.25839】Break the Inaccessible Boundary: Distilling Post-Conversion Content for User Retention Modeling

链接https://arxiv.org/abs/2604.25839

作者:Tianbao Ma,Ruochen Yang,Chengen Li,Yuexin Shi,Jiangxia Cao,Linxun Chen,Zhaojie Liu,Yanan Niu,Han Li,Kun Gai

类目:Information Retrieval (cs.IR)

关键词:measure long-term engagement, modern platforms, key metric, metric to measure, measure long-term

备注: Work in progress

点击查看摘要

Abstract:User retention is a key metric to measure long-term engagement in modern platforms. In real-time bidding (RTB) advertising system for user re-engagement, the retention model is required to predict future revisit probability at bidding time, before the user converts and consumes any content. Although post-conversion content, termed Onboarding Content, provides highly informative signals for retention prediction, directly using it in training causes severe feature leakage and creates a gap between training and serving. To address this issue, we propose OCARM, a two-stage distillation-aligned framework for Onboarding Content Augmented Retention Modeling, enabling the model to implicitly capture future content using only observable features during inference. In the first stage, we deliberately expose onboarding content to train a hierarchical encoder that produces teacher representations. In the second stage, a user encoder is aligned with the frozen teacher through distillation, allowing the model to approximate the inaccessible onboarding signals without leakage. Extensive offline experiments and online A/B tests demonstrate that our framework achieves consistent improvements in a real-world growth scenario.

3. 【2604.25834】Action-Aware Generative Sequence Modeling for Short Video Recommendation

链接https://arxiv.org/abs/2604.25834

作者:Wenhao Li,Zihan Lin,Zhengxiao Guo,Jie Zhou,Shukai Liu,Yongqi Liu,Chuan Luo,Chaoyi Ma,Ruiming Tang,Han Li

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:increasingly higher expectations, rapid development, increasingly higher, higher expectations, content consumption platforms

备注: 11 pages, 8 figures, SIGIR 2026

点击查看摘要

Abstract:With the rapid development of the Internet, users have increasingly higher expectations for the recommendation accuracy of online content consumption platforms. However, short videos often contain diverse segments, and users may not hold the same attitude toward all of them. Traditional binary-classification recommendation models, which treat a video as a single holistic entity, face limitations in accurately capturing such nuanced preferences. Considering that user consumption is a temporal process, this paper demonstrates that the timing of user actions can represent diverse intentions through statistical analysis and examination of action patterns. Based on this insight, we propose a novel modeling paradigm: Action-Aware Generative Sequence Network (A2Gen), which refines user actions along the temporal dimension and chains them into sequences for unified processing and prediction. First, we introduce the Context-aware Attention Module (CAM) to model action sequences enriched with item-specific contextual features. Building upon this, we develop the Hierarchical Sequence Encoder (HSE) to learn temporal action patterns from users' historical actions. Finally, through leveraging CAM, we design a module for action sequence generation: the Action-seq Autoregressive Generator (AAG). Extensive offline experiments on the Kuaishou's dataset and the Tmall public dataset demonstrate the superiority of our proposed model. Furthermore, through large-scale online A/B testing deployed on Kuaishou's platform, our model achieves significant improvements over baseline methods in multi-task prediction by leveraging sequential information. Specifically, it yields increases of 0.34% in user watch time, 8.1% in interaction rate, and 0.162% in overall user retention (LifeTime-7), leading to successful deployment across all traffic, serving over 400 million users every day.

4. 【2604.25787】Harmonizing Generative Retrieval and Ranking in Chain-of-Recommendation

链接https://arxiv.org/abs/2604.25787

作者:Yu Liu,Jiangxia Cao

类目:Information Retrieval (cs.IR)

关键词:OneRec series works, auto-regressive semantic IDs, formulating next-item prediction, Generative recommender systems, semantic IDs generation

备注: Work in progress

点击查看摘要

Abstract:Generative recommender systems have recently emerged as a promising paradigm by formulating next-item prediction as an auto-regressive semantic IDs generation, such as OneRec series works. However, with the next-item-agnostic prediction paradigm, its could beam out some next potential items via Semantic IDs but hard to estimate which items are better from them, e.g., select the top-10 from beam-256 items, leading to a gap between generation and ranking performance. To fulfill this gap, we propose RecoChain, a unified generative retrieval and ranking framework that integrates candidate generation and ranking within a single Transformer backbone. Specifically, in inference, the model first generates candidate items via hierarchical semantic ID prediction, then performs the SIM-based ranking process to estimate the click possibility of corresponding item candidate continuously. Extensive experiments on large-scale real-world datasets demonstrate that our approach effectively bridges the gap between generative retrieval and ranking, achieving improved Top-K recommendation performance while maintaining strong generative capability.

5. 【2604.25778】Can Code Evaluation Metrics Detect Code Plagiarism?

链接https://arxiv.org/abs/2604.25778

作者:Fahad Ebrahim,Mike Joy(The University of Warwick)

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Source Code Plagiarism, Code Plagiarism Detection, Source Code, Code Evaluation Metrics, Code Plagiarism

备注: 10 pages, 5 figures, accepted at LEARNER 2026 workshop (associated with EASE 2026)

点击查看摘要

Abstract:Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Structured Edit Distance (TSED), and CodeBERTScore. The performance is evaluated using threshold-free ranking-based measures to assess overall, per dataset, and per-level plagiarism performance. The results are compared against state-of-the-art (SOTA) Source Code Plagiarism Detection Tools (SCPDTs), JPlag and Dolos. Our findings show that without preprocessing, Dolos achieves the highest overall ranking performance, while among the individual metrics, CrystalBLEU, CodeBLEU, and RUBY outperform JPlag. Performance is strongest at L1 and drops from L4 onward, while CrystalBLEU remains competitive on L6. With preprocessing, CrystalBLEU surpasses Dolos overall. Per dataset, Dolos achieved the best ranking on the ConPlag raw dataset, while CrystalBLEU was the best-performing metric on the remaining datasets. At the plagiarism levels, Dolos remains strongest on L4, while Crystal-BLEU leads most of the remaining difficult levels. These results indicate that CEMs are comparable to dedicated tools in terms of ranking metrics.

Comments:
10 pages, 5 figures, accepted at LEARNER 2026 workshop (associated with EASE 2026)

Subjects:

Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2604.25778 [cs.SE]

(or
arXiv:2604.25778v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.25778

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2604.25732】Personalized Multi-Interest Modeling for Cross-Domain Recommendation to Cold-Start Users

链接https://arxiv.org/abs/2604.25732

作者:Xiaodong Li,Jiawei Sheng,Jiangxia Cao,Xinghua Zhang,Wenyuan Zhang,Yong Sun,Shirui Pan,Zhihong Tian,Tingwen Liu

类目:Information Retrieval (cs.IR)

关键词:Cross-domain recommendation, CDR, preference, CDR approaches, user cold-start issue

备注

点击查看摘要

Abstract:Cross-domain recommendation (CDR) has demonstrated to be an effective solution for alleviating the user cold-start issue. By leveraging rich user-item interactions available in a richly informative source domain, CDR could improve the recommendation performance for cold-start users in the target domain. Previous CDR approaches mostly adhere the Embedding and Mapping (EMCDR) paradigm, which learns a user-shared mapping function to transfer users' preference from the source domain to the target domain, neglecting users' personalized preference. Recent CDR approaches further leverage the meta-learning paradigm, considering the CDR task for each user independently and learning user-specific mapping functions for each user. However, they mostly learn representations for each user individually, which ignores the common preference between different users, neglecting valuable information for CDR. In addition, all these approaches usually summarize the user's preference into an overall representation, which can hardly capture the user's multi-interest preference. To this end, we propose a personalized multi-interest modeling framework for CDR to cold-start users, termed as NF-NPCDR. Specifically, we propose a personalized preference encoder that enhances the neural process (NP) with the normalizing flow (NF) to convert the Gaussian (unimodal) distribution to a multimodal distribution, providing a novel way to capture the user's personalized multi-interest preference. Then, we propose a common preference encoder with a preference pool to capture the common preference between different users. Furthermore, we introduce a stochastic adaptive decoder to incorporate both the personalized and common preference for cold-start users, adaptively modulating both preference for better recommendation.

7. 【2604.25707】From Citation Selection to Citation Absorption: A Measurement Framework for Generative Engine Optimization Across AI Search Platforms

链接https://arxiv.org/abs/2604.25707

作者:Zhang Kai,Yao Jingang

类目:Information Retrieval (cs.IR)

关键词:Generative Engine Optimization, engines increasingly determine, Generative search engines, search engines increasingly, Generative Engine

备注: 26 pages, 11 figures. Public dataset and analysis pipeline: [this https URL](https://github.com/yaojingang/geo-citation-lab)

点击查看摘要

Abstract:Generative search engines increasingly determine whether online information is merely discoverable, cited as a source, or actually absorbed into generated answers. This paper proposes a two-stage measurement framework for Generative Engine Optimization (GEO): citation selection, where a platform triggers search and chooses sources, and citation absorption, where a cited page contributes language, evidence, structure, or factual support to the final answer. We analyze the public geo-citation-lab dataset covering 602 controlled prompts across ChatGPT, Google AI Overview/Gemini, and Perplexity; 21,143 valid search-layer citations; 23,745 citation-level feature records; 18,151 successfully fetched pages; and 72 extracted features. The central descriptive finding is that citation breadth and citation depth diverge. Perplexity and Google cite more sources on average, while ChatGPT cites fewer sources but shows substantially higher average citation influence among fetched pages. High-influence pages tend to be longer, more structured, semantically aligned, and richer in extractable evidence such as definitions, numerical facts, comparisons, and procedural steps. The results suggest that GEO should be measured beyond citation counts, with answer-level absorption treated as a separate outcome.

8. 【2604.25683】K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance

链接https://arxiv.org/abs/2604.25683

作者:Chen Yifei,Tian Zhixing,Wang Chenyang,Cheng Ziguang

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, paper targets e-commerce, targets e-commerce search, paper targets, Large Language

备注

点击查看摘要

Abstract:This paper targets e-commerce search relevance. While Large Language Models (LLMs) have demonstrated significant potential in this field, they often encounter performance bottlenecks in persistent 'corner cases' within complex industrial scenarios. Existing research primarily focuses on optimizing reasoning trajectories via Reinforcement Learning. However, real-world observations suggest that the primary bottleneck stems from knowledge boundaries, where the absence of domain-specific intelligence in the model's parametric memory creates a contextual void. This void persists when interpreting idiosyncratic queries or niche products and cannot be resolved solely through reasoning-path optimization. To bridge this gap, we propose K-CARE, a framework that extends the model's cognitive reach by grounding reasoning in external knowledge. K-CARE comprises two synergistic components: (1) Symmetrical Contextual Anchoring (SCA), which fills the contextual void by anchoring queries and products with behavior-derived implicit knowledge; and (2) Analogical Prototype Reasoning (APR), which leverages expert-curated prototypical knowledge to calibrate decision boundaries through in-context analogy. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that K-CARE significantly outperforms state-of-the-art baselines, delivering substantial commercial impact by resolving knowledge-intensive relevance challenges.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.25683 [cs.IR]

(or
arXiv:2604.25683v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.25683

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
9. 【2604.25665】LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

链接https://arxiv.org/abs/2604.25665

作者:Huyen Nguyen,Haoxuan Zhang,Yang Zhang,Junhua Ding,Haihua Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:generated summaries remains, large language model, Reliable evaluation, open challenge, large language

备注: 15 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

10. 【2604.25605】Health System Scale Semantic Search Across Unstructured Clinical Notes

链接https://arxiv.org/abs/2604.25605

作者:Faith Wavinya Mutinda,Spandana Makeneni,Anna Lin,Shivaji Dutta,Irit R. Rasooly,Patrick Dibussolo,Shivani Kamath Belman,Hessam Shahriari,Kevin Murphy,Alex B. Ruan,Barbara H. Chaiyachati,Sanjay Chainani,Robert W. Grundmeier,Scott M. Haag,Jeffrey M. Miller,Heather M. Griffis,Ian M. Campbell

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:offers substantial advantages, retrieves documents based, Semantic search, keyword matching, offers substantial

备注: for associated code, see [this https URL](https://github.com/Ian-Campbell-Lab/clinical-semantic-search)

点击查看摘要

Abstract:Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

11. 【2604.25577】he Attention Market: Interpreting Online Fair Re-ranking as Manifold Optimization under Walrasian Equilibrium

链接https://arxiv.org/abs/2604.25577

作者:Chen Xu,Wei Chu,Wenyu Hu,Fengran Mo,Jun Xu,Maarten de Rijke

类目:Information Retrieval (cs.IR)

关键词:promote long-tail items, Fair re-ranking aims, Fair re-ranking, online fair re-ranking, information retrieval

备注: Accepted in SIGIR'26

点击查看摘要

Abstract:Fair re-ranking aims to promote long-tail items and enhance diversity within groups in information retrieval. While previous research on online fairness-aware re-ranking has shown promising outcomes, our comprehensive evaluation of online fair re-ranking methods over 20 settings reveals significant performance disparities among existing methods. To uncover the root causes of these inconsistencies, we reformulate fair re-ranking within an attentional market framework governed by a Walrasian Equilibrium, where the fairness is treated as a taxation cost. This market-based formulation is then coupled with manifold optimization, demonstrating that seeking this equilibrium is equivalent to performing gradient descent on a specific ranking manifold constructed by the market. Different re-ranking settings induce distinct manifold geometries, and these intrinsic geometric differences dictate the gradient landscapes and optimization trajectories. We propose ManifoldRank, an efficient online fair re-ranking algorithm. ManifoldRank adjusts gradients to align with the ranking manifold, considering various contextual settings. On the supply side, it incorporates a gradient adjustment based on different fairness requirements, accounting for associated costs. On the demand side, it empirically predicts an additional gradient adjustment term derived from the ranking scores. By integrating these two gradient adjustments, ManifoldRank effectively balances fairness and accuracy. Experimental results across multiple datasets confirm ManifoldRank's effectiveness.

12. 【2604.25487】A contemporary science map through the lens of IEEE and ACM periodicals

链接https://arxiv.org/abs/2604.25487

作者:George Margaritis,Dionysios Kritsas,Dimitrios Katsaros,Yannis Manolopoulos

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:serving these disciplines, computing and electrical, electronics engineering, engineering which publish, publish and organize

备注

点击查看摘要

Abstract:ACM and IEEE are the two premier associations on computing and electrical/electronics engineering which publish and organize the great majority of periodicals and conferences, respectively, serving these disciplines. Science is a constantly evolving process, and these publication fora are expected to follow the trends. In this article, we focus on the periodicals published by the two associations and seek to detect and/or confirm any contemporary science trends as these are reflected to the periodical titles established recently. Our study is rather qualitative than quantitative, aiming at revealing patterns immediately comprehensible and validatable by the reader. Among the most notable patterns, we see a growing preference of both associations for the open access mode of publication; we also observe ACM's orientation toward AI-focused periodicals, and most importantly, a significant theme overlap among periodicals of the same association and this is valid for both ACM and IEEE.

13. 【2604.25390】GeoSearch: Augmenting Worldwide Geolocalization with Web-Scale Reverse Image Search and Image Matching

链接https://arxiv.org/abs/2604.25390

作者:Tung-Duong Le-Duc,Hoang-Quoc Nguyen-Son,Minh-Son Dao

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Worldwide image geolocalization, global visual diversity, remains challenging due, Large Multimodal Models, predict the GPS

备注: Accepted to SIGIR 2026 Main Conference

点击查看摘要

Abstract:Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.

14. 【2604.25349】Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

链接https://arxiv.org/abs/2604.25349

作者:Julián Urbano

类目:Information Retrieval (cs.IR); Applications (stat.AP); Methodology (stat.ME)

关键词:Information Retrieval systems, Information Retrieval, Retrieval systems, benchmarking of Information, Wilcoxon signed-rank test

备注: 11 pages, 5 tables, 2 figures, ACM SIGIR 2026

点击查看摘要

Abstract:In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

15. 【2604.25291】From Local Indices to Global Identifiers: Generative Reranking for Recommender Systems via Global Action Space

链接https://arxiv.org/abs/2604.25291

作者:Pengyue Jia,Xiaobei Wang,Yingyi Zhang,Shuchang Liu,Yupeng Hou,Hailan Yang,Xu Gao,Xiaopeng Li,Yejing Wang,Julian McAuley,Xiang Li,Lantao Hu,Yongqi Liu,Kaiqiao Zhan,Han Li,Kun Gai,Xiangyu Zhao

类目:Information Retrieval (cs.IR)

关键词:modern recommender systems, impacting user satisfaction, modeling complex intra-list, intra-list item dependencies, complex intra-list item

备注

点击查看摘要

Abstract:In modern recommender systems, list-wise reranking serves as a critical phase within the multi-stage pipeline, finalizing the exposed item sequence and directly impacting user satisfaction by modeling complex intra-list item dependencies. Existing methods typically formulate this task as selecting indices from the local input list. However, this approach suffers from a semantically inconsistent action space: the same output neuron (logits) represents different items across different samples, preventing the model from establishing a stable, intrinsic understanding of the items. To address this, we propose GloRank (Global Action Space Ranker), a generative framework that shifts reranking from selecting local indices to generating global identifiers. Specifically, we represent items as sequences of discrete tokens and reformulate reranking as a token generation task. This design effectively decouples the scoring mechanism from the variable input order, ensuring that items are evaluated against a consistent global standard. We further enhance this with a two-stage optimization pipeline: a supervised pre-training phase to initialize the model with high-quality demonstrations, followed by a reinforcement learning-based post-training phase to directly maximize list-wise utility. Extensive experiments on two public benchmarks and a large-scale industrial dataset, coupled with online A/B tests, demonstrate that GloRank consistently outperforms state-of-the-art baselines and achieves superior robustness in cold-start scenarios.

16. 【2604.25142】UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

链接https://arxiv.org/abs/2604.25142

作者:Jongyoon Kim,Minseong Hwang,Seung-won Hwang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Unsupervised domain adaptation, generalizes neural retrievers, Unsupervised domain, adaptation generalizes neural, generating pseudo queries

备注: ACL 2026 (Findings)

点击查看摘要

Abstract:Unsupervised domain adaptation generalizes neural retrievers to an unseen domain by generating pseudo queries on target domain documents. The quality and efficiency of this adaptation critically depend on which documents are selected for pseudo query generation. The existing document sampling method focuses on diversity but fails to capture model uncertainty. In contrast, we propose **Un**certainty-based **Ite**rative Document Sampling (UnIte) addressing these limitations by (1) filtering documents with high aleatoric uncertainty and (2) prioritizing those with high epistemic uncertainty, maximizing the learning utility of the current model. We conducted extensive experiments on a large corpus of BEIR with small and large models, showing significant gains of +2.45 and +3.49 nDCG@10 with a smaller training sample size, 4k on average.

17. 【2604.25057】CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization

链接https://arxiv.org/abs/2604.25057

作者:Chenxu Niu,Yiming Sun

类目:Machine Learning (cs.LG); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:questions remain scarce, Understanding the geographic, grant applications, career development, collaboration discovery

备注

点击查看摘要

Abstract:Understanding the geographic reach and community structure of one's scholarly citations is increasingly valuable for career development, grant applications, and collaboration discovery -- yet accessible tools for answering these questions remain scarce. Existing bibliometric platforms either require costly institutional subscriptions or expose only aggregate citation counts without granular per-author metadata. We present CiteRadar, an open-source system that accepts a single Google Scholar user identifier and automatically produces a structured output folder containing: the author's complete publication list, all retrieved citing papers with enriched author metadata, two ranked author tables (by citation frequency and by h-index), a plain-text statistical summary, and a self-contained interactive HTML world map -- all from a single command-line invocation. CiteRadar integrates five heterogeneous data sources -- Google Scholar, OpenAlex, CrossRef, Semantic Scholar, and OpenStreetMap Nominatim -- through a carefully engineered five-stage pipeline. Key technical contributions include: (1) a Scholar meta-string parser resilient to Unicode non-breaking-space separators, a pervasive but undocumented quirk in Scholar's HTML that silently corrupts venue and year fields when unhandled; (2) a two-stage author disambiguation system using stop-word-filtered institution name similarity to guard against the well-known same-name entity-merging failure mode in bibliometric databases, demonstrated to eliminate h-index attribution errors of up to 9x the correct value; (3) an OpenAlex web-URL to API-URL conversion fix that raises the fraction of author records with city-level location data from 0% to ~60%; and (4) a logarithmically-scaled interactive Folium world map with per-city researcher popups, rendered as a fully self-contained HTML file.

18. 【2604.25032】Offline Evaluation Measures of Fairness in Recommender Systems

链接https://arxiv.org/abs/2604.25032

作者:Theresia Veronika Rampisela

类目:Information Retrieval (cs.IR)

关键词:responsible artificial intelligence, evaluation, increasingly important, artificial intelligence, fairness evaluation measures

备注: PhD thesis

点击查看摘要

Abstract:The evaluation of recommender system fairness has become increasingly important, especially with recent legislation that emphasises the development of fair and responsible artificial intelligence. This has led to the emergence of various fairness evaluation measures, which quantify fairness based on different definitions. However, many of such measures are simply proposed and used without further analysis on their robustness. As a result, there is insufficient understanding and awareness of the measures' limitations. Among other issues, it is not known what kind of model outputs produce the (un)fairest score, how the measure scores are empirically distributed, and whether there are cases where the measures cannot be computed (e.g., due to division by zero). These issues cause difficulty in interpreting the measure scores and confusion on which measure(s) should be used for a specific case. This thesis presents a series of papers that assess and overcome various theoretical, empirical, and conceptual limitations of existing recommender system fairness evaluation measures. We investigate a wide range of offline evaluation measures for different fairness notions, divided based on the evaluation subjects (users and items) and for different evaluation granularities (groups of subjects and individual subjects). Firstly, we perform theoretical and empirical analysis on the measures, exposing flaws that limit their interpretability, expressiveness, or applicability. Secondly, we contribute novel evaluation approaches and measures that overcome these limitations. Finally, considering the measures' limitations, we recommend guidelines for the appropriate measure usage, thereby allowing for more precise selection of fairness evaluation measures in practical scenarios. Overall, this thesis contributes to advancing the state-of-the-art offline evaluation of fairness in recommender systems.

Comments:
PhD thesis

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.25032 [cs.IR]

(or
arXiv:2604.25032v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.25032

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
19. 【2604.24806】Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

链接https://arxiv.org/abs/2604.24806

作者:Liang Guo,Ge Song,Litao Deng,Jianhui Sun,Chufeng Hu,Lu Zhang,Zhen Ma,Shouwei Chen,Weiran Liu,Sarang Masti Sreeshylan,Xiaoxuan Meng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:User Interaction History, ultra-long User Interaction, Deep Learning Recommendation, Modern Deep Learning, Interaction History

备注

点击查看摘要

Abstract:Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

计算机视觉

1. 【2604.25889】Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

链接https://arxiv.org/abs/2604.25889

作者:Minh-Khoa Le-Phan,Minh-Hoang Le,Trong-Le Do,Minh-Triet Tran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe lossy compression, pristine academic datasets, Current deepfake detection, detection models achieve, suffer severe spatial

备注: 4th place (out of 94 teams) in the NTIRE 2026 Robust Deepfake Detection Challenge

点击查看摘要

Abstract:Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at this https URL.

2. 【2604.25887】No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

链接https://arxiv.org/abs/2604.25887

作者:Anas Gamal Aly,Hala ElAarag

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)

关键词:vulnerable road users, leave vulnerable road, Current pedestrian crossing, distracted pedestrians stranded, Current pedestrian

备注: © Anas Gamal Aly and Hala ElAarag, 2026. This is the authors' version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record will be published in Proceedings of the 2026 ACM Southeast Conference (ACMSE 2026)

点击查看摘要

Abstract:Current pedestrian crossing signals operate on fixed timing without adjustment to pedestrian behavior, which can leave vulnerable road users (VRUs) such as the elderly, disabled, or distracted pedestrians stranded when the light changes. We introduce No Pedestrian Left Behind (NPLB), a real-time adaptive traffic signal system that monitors VRUs in crosswalks and automatically extends signal timing when needed. We evaluated five state-of-the-art object detection models on the BGVP dataset, with YOLOv12 achieving the highest mean Average Precision at 50% (mAP@0.5) of 0.756. NPLB integrates our fine-tuned YOLOv12 with ByteTrack multi-object tracking and an adaptive controller that extends pedestrian phases when remaining time falls below a critical threshold. Through 10,000 Monte Carlo simulations, we demonstrate that NPLB improves VRU safety by 71.4%, reducing stranding rates from 9.10% to 2.60%, while requiring signal extensions in only 12.1% of crossing cycles.

3. 【2604.25855】SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

链接https://arxiv.org/abs/2604.25855

作者:Hector G. Rodriguez,Marcus Rohrbach

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, achieve ever-stronger performance, Multimodal large, large language models, achieve ever-stronger

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.

4. 【2604.25819】Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

链接https://arxiv.org/abs/2604.25819

作者:Yupeng Zhou,Lianghua Huang,Zhifan Wu,Jiabao Wang,Yupeng Shi,Biao Jiang,Daquan Zhou,Yu Liu,Ming-Ming Cheng,Qibin Hou

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Mutual Forcing, propose Mutual Forcing, long-horizon audio-video synchronization, audio-video, propose Mutual

备注

点击查看摘要

Abstract:In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at this https URL.

5. 【2604.25817】Magnification-Invariant Image Classification via Domain Generalization and Stable Sparse Embedding Signatures

链接https://arxiv.org/abs/2604.25817

作者:Ifeanyi Ezuma,Olusiji Medaiyese

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:robust histopathology classification, histopathology classification, major obstacle, obstacle to robust, robust histopathology

备注: 12 pages, 7 figures, 3 tables. Preprint manuscript

点击查看摘要

Abstract:Magnification shift is a major obstacle to robust histopathology classification, because models trained on one imaging scale often generalize poorly to another. Here, we evaluated this problem on the BreaKHis dataset using a strict patient-disjoint leave-one-magnification-out protocol, comparing supervised baseline, baseline augmented with DCGAN-generated patches, and a gradient-reversal domain-general model designed to preserve discriminative information while suppressing magnification-specific variation. Across held-out magnifications, the domain-general model achieved the strongest overall discrimination and its clearest gain was observed when 200X was held out. By contrast, GAN augmentation produced inconsistent effects, improving some folds but degrading others, particularly at 400X. The domain-general model also yielded the lowest Brier score at 0.063 vs 0.089 at baseline. Sparse embedding analysis further revealed that domain-general training reduced average signature size more than three-fold (306 versus 1,074 dimensions) while preserving equivalent predictive performance (AUC: 0.967 vs 0.965; F1: 0.930 vs 0.931). It also increased cross-fold signature reproducibility from near-zero Jaccard overlap in the baseline to 0.99 between the 100X and 200X folds. These findings show that calibrated, compact, and transferable representations can be learned without added architectural complexity, with clear implications for the reliable deployment of computational pathology models across heterogeneous acquisition settings.

6. 【2604.25809】Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

链接https://arxiv.org/abs/2604.25809

作者:Yashwant Pravinrao Bangde,Debaditya Roy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frequently generate fluent, generate fluent outputs, exhibit strong performance, Vision-Language Models, exhibit strong

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), maintains two parallel probability distributions of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrast-based gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD2 on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering, including POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench. IECD2 demonstrates consistent improvements in task accuracy and reasoning performance, alongside a substantial reduction in hallucination across all evaluation metrics compared to state-of-the-art decoding approaches.

7. 【2604.25795】Improving Diversity in Black-box Few-shot Knowledge Distillation

链接https://arxiv.org/abs/2604.25795

作者:Tri-Nhan Vo,Dang Nguyen,Kien Do,Sunil Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:sacrifice in performance, Knowledge distillation, well-known technique, technique to effectively, effectively compress

备注

点击查看摘要

Abstract:Knowledge distillation (KD) is a well-known technique to effectively compress a large network (teacher) to a smaller network (student) with little sacrifice in performance. However, most KD methods require a large training set and internal access to the teacher, which are rarely available due to various restrictions. These challenges have originated a more practical setting known as black-box few-shot KD, where the student is trained with few images and a black-box teacher. Recent approaches typically generate additional synthetic images but lack an active strategy to promote their diversity, a crucial factor for student learning. To address these problems, we propose a novel training scheme for generative adversarial networks, where we adaptively select high-confidence images under the teacher's supervision and introduce them to the adversarial learning on-the-fly. Our approach helps expand and improve the diversity of the distillation set, significantly boosting student accuracy. Through extensive experiments, we achieve state-of-the-art results among other few-shot KD methods on seven image datasets. The code is available at this https URL.

8. 【2604.25794】Diverse Image Priors for Black-box Data-free Knowledge Distillation

链接https://arxiv.org/abs/2604.25794

作者:Tri-Nhan Vo,Dang Nguyen,Trung Le,Kien Do,Sunil Gupta

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:complex teacher networks, efficient student models, represents a vital, vital mechanism, mechanism to transfer

备注

点击查看摘要

Abstract:Knowledge distillation (KD) represents a vital mechanism to transfer expertise from complex teacher networks to efficient student models. However, in decentralized or secure AI ecosystems, privacy regulations and proprietary interests often restrict access to the teacher's interface and original datasets. These constraints define a challenging black-box data-free KD scenario where only top-1 predictions and no training data are available. While recent approaches utilize synthetic data, they still face limitations in data diversity and distillation signals. We propose Diverse Image Priors Knowledge Distillation (DIP-KD), a framework that addresses these challenges through a three-phase collaborative pipeline: (1) Synthesis of image priors to capture diverse visual patterns and semantics; (2) Contrast to enhance the collective distinction between synthetic samples via contrastive learning; and (3) Distillation via a novel primer student that enables soft-probability KD. Our evaluation across 12 benchmarks shows that DIP-KD achieves state-of-the-art performance, with ablations confirming data diversity as critical for knowledge acquisition in restricted AI environments.

9. 【2604.25781】Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

链接https://arxiv.org/abs/2604.25781

作者:Yi Yang,Hao Pan,Yijing Cui,Alla Sheffer,Changjian Li

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:enabling interactive animation, Articulation modeling aims, interactive animation, shape editing, aims to infer

备注: Project page: [this https URL](https://arlo-yang.github.io/Sketch2Arti)

点击查看摘要

Abstract:Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at this https URL.

10. 【2604.25720】oward Multimodal Conversational AI for Age-Related Macular Degeneration

链接https://arxiv.org/abs/2604.25720

作者:Ran Gu,Benjamin Hou,Mélanie Hébert,Asmita Indurkar,Yifan Yang,Emily Y. Chew,Tiarnán D. L. Keenan,Zhiyong Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:retinal disease detection, deep learning models, systems produce static, disease detection, deep learning

备注: 38 pages, 4 figures

点击查看摘要

Abstract:Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.

11. 【2604.25688】QB-LIF: Learnable-Scale Quantized Burst Neurons for Efficient SNNs

链接https://arxiv.org/abs/2604.25688

作者:Dewei Bai,Hongxiang Peng,Jiajun Mei,Yang Ren,Hong Qu,Dawen Xia,Zhang Yi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:representation fundamentally limits, limits information throughput, fundamentally limits information, spiking neural networks, Binary spike coding

备注

点击查看摘要

Abstract:Binary spike coding enables sparse and event-driven computation in spiking neural networks (SNNs), yet its 1-bit-per-timestep representation fundamentally limits information throughput. This bottleneck becomes increasingly restrictive in deep architectures under short simulation horizons. We propose the Quantized Burst-LIF (QB-LIF) neuron, which reformulates burst spiking as a saturated uniform quantization of membrane potentials with a learnable scale. Instead of relying on predefined multi-threshold structures, QB-LIF treats the quantization scale as a trainable parameter, allowing each layer to autonomously adapt its spiking resolution to the underlying membrane-potential statistics. To preserve hardware efficiency, we introduce an absorbable scale strategy that folds the learned quantized scale into synaptic weights during inference, maintaining a strict accumulate-only (AC) execution paradigm. To enable stable optimization in the discrete multi-level space, we further design ReLSG-ET, a rectified-linear surrogate gradient with exponential tails that sustains gradient flow across burst intervals. Extensive experiments on static (CIFAR-10/100, ImageNet) and event-driven (CIFAR10-DVS, DVS128-Gesture) benchmarks demonstrate that QB-LIF consistently outperforms binary and fixed-burst SNNs, achieving higher accuracy under ultra-low latency while preserving neuromorphic compatibility.

12. 【2604.25680】Exploring Remote Photoplethysmography for Neonatal Pain Detection from Facial Videos

链接https://arxiv.org/abs/2604.25680

作者:Ashutosh Dhamaniya,Anup Kumar Gupta,Trishna Saikia,Puneet Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:including delayed development, slower weight gain, reliable pain assessment, Unaddressed pain, adverse effects

备注: 25 pages, 9 figures, 10 tables. Proposed rPPG-based method for neonatal pain detection from facial videos, with multimodal (rPPG + audio) analysis and extensive ablation studies on the iCOPEvid dataset

点击查看摘要

Abstract:Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse signals in a non-contact manner and employ them for neonatal pain detection. The temporal signals acquired from regions-of-interest (ROIs) affected by skin deformations may exhibit lower quality and provide erroneous rPPG signals. Therefore, we incorporated a quality parameter to select the temporal signals obtained from ROIs that are least affected by skin deformations. Further, we employed signal-to-noise ratio as a fitness parameter to extract the rPPG signal corresponding to the clip that is least affected by noise. Experimental findings demonstrate that the rPPG signals provide useful information for neonatal pain detection, and signals extracted from the blue colour channel outperform those extracted from other colour channels. We also show that combining rPPG and audio features provides better results than individual modalities.

13. 【2604.25646】SAMe: A Semantic Anatomy Mapping Engine for Robotic Ultrasound

链接https://arxiv.org/abs/2604.25646

作者:Jing Zhang,Duojie Chen,Wentao Jiang,Zihan Lou,Jianxin Liu,Xinwu Cui,Qinghong Zhao,Bo Du,Christoph F. Dietrich,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:advanced local image-driven, current systems lack, individual patient anatomy, anatomical understanding needed, local image-driven control

备注: Supplementary information included. Code will be released at [this https URL](https://github.com/MiliLab/Echo-SAMe)

点击查看摘要

Abstract:Robotic ultrasound has advanced local image-driven control, contact regulation, and view optimization, yet current systems lack the anatomical understanding needed to determine what to scan, where to begin, and how to adapt to individual patient anatomy. These gaps make systems still reliant on expert intervention to initiate scanning. Here we present SAMe, a semantic anatomy mapping engine that provides robotic ultrasound with an explicit anatomical prior layer. SAMe addresses scan initiation as a target-to-anatomy-to-action process: it grounds under-specified clinical complaints into structured target organs, instantiates a patient-specific anatomical representation for the grounded targets from a single external body image, and translates this representation into control-facing 6-DoF probe initialization states without any additional registration using preoperative CT or MRI. The anatomical representation maintained by SAMe is explicit, lightweight (single-organ inference in 0.08s), and compatible with downstream control by design. Across semantic grounding, anatomical instantiation, and real-robot evaluation, SAMe shows strong performance across the full initialization pipeline. In real-robot experiments, SAMe achieved overall organ-hit rates of 97.3% for liver initialization and 81.7% for kidney initialization across the evaluated target sets. Even when restricted to the centroid target, SAMe outperformed the surface-heuristic baseline for both liver and kidney initialization. These results establish an explicit anatomical prior layer that addresses scan initialization and is designed to support broader downstream autonomous scanning pipelines, providing the anatomical foundation for complaint-driven, anatomically informed robotic ultrasonography.

14. 【2604.25642】Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

链接https://arxiv.org/abs/2604.25642

作者:Chengsheng Zhang,Chenghao Sun,Xinyan Jiang,Wei Li,Xinmei Tian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, achieved remarkable progress, Large Vision-Language, Vision-Language Models, visual-textual understanding

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: this https URL.

15. 【2604.25636】Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

链接https://arxiv.org/abs/2604.25636

作者:Jiayi Guo,Linqing Wang,Jiangshan Wang,Yang Yue,Zeyu Liu,Zhiyuan Zhao,Qinglin Lu,Gao Huang,Chunyu Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrate visual understanding, Unified multimodal models, multimodal models, integrate visual, visual understanding

备注: GitHub: [this https URL](https://github.com/LeapLabTHU/RvR)

点击查看摘要

Abstract:Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

16. 【2604.25574】Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

链接https://arxiv.org/abs/2604.25574

作者:Jialong Wu,Yihan Wang,Matthias Rottmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low deployment cost, offers complementary sensing, autonomous driving, deployment cost, camera-radar fusion offers

备注

点击查看摘要

Abstract:In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.

17. 【2604.25570】Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

链接https://arxiv.org/abs/2604.25570

作者:Dewei Bai,Hongxiang Peng,Yunyun Zeng,Ziyu Zhang,Hong Qu,Yi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown strong potential, shown strong, strong potential, Spiking, spike-driven self-attention

备注

点击查看摘要

Abstract:Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.

18. 【2604.25545】opoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

链接https://arxiv.org/abs/2604.25545

作者:Fuchen Zheng,Chengpei Xu,Long Ma,Weixuan Li,Junhua Zhou,Xuhang Chen,Weihuang Liu,Haolun Li,Quanjun Li,Zhenxi Zhang,Lei Zhao,Chi-Man Pun,Shoujun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:amplify redundant responses, Visual state-space models, shown strong potential, naive multi-branch fusion, medical visual media

备注: 15 pages, 9 figures

点击查看摘要

Abstract:Visual state-space models (SSMs) have shown strong potential for medical image segmentation, yet their effectiveness is often limited by two practical issues: axis-biased scan ordering weakens the modeling of oblique and curved structures, and naive multi-branch fusion tends to amplify redundant responses. We present TopoMamba, a topology-aware scan-and-fuse framework for segmenting heterogeneous medical visual media. The method combines a diagonal/anti-diagonal TopoA-Scan branch with the standard Cross-Scan branch to provide complementary structural priors, and introduces ScanCache, a device-aware caching mechanism that amortizes explicit scan-index construction across recurring resolutions. To fuse heterogeneous scan features efficiently, we further propose a lightweight HSIC Gate that regulates branch interaction using a dependence-aware scalar gating rule. We also instantiate a volumetric TopoMamba-3D for practical 3D clinical segmentation. Experiments on Synapse CT, ISIC 2017 dermoscopy, and CVC-ClinicDB endoscopy show that TopoMamba consistently improves segmentation quality over strong CNN, Transformer, and SSM baselines, with particularly clear gains on thin or curved targets such as the pancreas and gallbladder, while maintaining favorable deployment efficiency under dynamic input resolutions. These results suggest that topology-aware scan ordering and lightweight dependence-aware fusion form an effective and practical design for medical multimedia segmentation. The code will be made publicly available.

19. 【2604.25533】DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

链接https://arxiv.org/abs/2604.25533

作者:Junchao Cui,Wenqi Shi,Shaoyong Du,Hang He,Xuanzi Ma,Hao Tang,Xiangyang Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Worldwide image geo-localization, spanning street, image geo-localization aims, continental scales, aims to infer

备注: ICME2026 Accept

点击查看摘要

Abstract:Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (1 km) and city-level (25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : this https URL.

20. 【2604.25530】he Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

链接https://arxiv.org/abs/2604.25530

作者:Muhammad Ali,Kevin Alexander Laube,Madan Ravi Ganesh,Lukas Schott,Niclas Popp,Thomas Brox

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:fixed iteration schedules, introduce increasingly complex, semantic segmentation introduce, segmentation introduce increasingly, introduce increasingly

备注: Presented at Efficient Computer Vision (ECV) Workshop, CVPR 2026 (non-archival). 5 pages, 3 figures

点击查看摘要

Abstract:Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

21. 【2604.25491】he Forensic Cost of Watermark Removal

链接https://arxiv.org/abs/2604.25491

作者:Gautier Evennou,Ewa Kijak

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:watermark removal, removal, watermark, perceptual quality, Current watermark removal

备注: preprint; accepted at IHMMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models"

点击查看摘要

Abstract:Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

22. 【2604.25477】DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

链接https://arxiv.org/abs/2604.25477

作者:Hanqing Yang,Qiang Zhou,Yongchao Du,Sashuai Zhou,Zhibin Wang,Jun Song,Tiezheng Ge,Cheng Yu,Bo Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:tasks requiring complex, Recent image editing, requiring complex reasoning, achieved strong visual, strong visual fidelity

备注

点击查看摘要

Abstract:Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing, we propose DDA-Thinker, a Thinker-centric framework designed for the independent optimization of a planning module (Thinker) over a fixed generative model (Editor). This decoupled Thinker-centric paradigm facilitates a controlled analysis of the planning module and makes its contribution under a fixed Editor easier to assess. To effectively guide this Thinker, we introduce a dual-atomic reinforcement learning framework. This framework decomposes feedback into two distinct atomic rewards implemented through verifiable checklists: a cognitive-atomic reward to directly assess the quality of the Thinker's executable plan, which serves as the actionable outcome of the Thinker's reasoning, and a visual-atomic reward to assess the final image quality. To improve checklist quality, our checklist synthesis is grounded not only in the source image and user instruction but also in a rational reference description of the ideal post-edit scene. To support this training, we further develop a two-stage data curation pipeline that first synthesizes a diverse and reasoning-focused dataset, then applies difficulty-aware refinement to curate an effective training curriculum for reinforcement learning. Extensive experiments on reasoning-driven image editing benchmarks, including RISE-Bench and KRIS-Bench, demonstrate that our approach substantially improves overall performance. Our method enables a community model to achieve results competitive with strong proprietary models, highlighting the practical potential of Thinker-centric optimization under a fixed-editor setting.

23. 【2604.25466】Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency

链接https://arxiv.org/abs/2604.25466

作者:Jingi Kim,Wonjun Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human Gaussian splatting, generalizable human Gaussian, photorealistic human rendering, actively studied, human Gaussian

备注: 10 pages, 8 figures, CVPR 2026 Findings

点击查看摘要

Abstract:Recently, generalizable human Gaussian splatting from sparse-view inputs has been actively studied for the photorealistic human rendering. Most existing methods rely on explicit geometric constraints or predefined structural representations to accurately position 3D Gaussians. Although these approaches have shown the remarkable progress in this field, they still suffer from inconsistent feature representations across multi-view inputs due to complex articulations of the human body and limited overlaps between different views. To address this problem, we propose a novel method to accurately localize 3D Gaussians and ultimately improve the quality of human rendering. The key idea is to unproject latent embeddings encoded from each viewpoint into a shared 3D space through predicted depth maps and recalibrate them belonging to the same body part based on cross-view attention. This helps the model resolve the spatial ambiguity occurring in highly textured regions as well as occluded body parts, thus leading to the accurate localization of 3D Gaussians. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of generalizable human Gaussian splatting from sparse-view inputs.

24. 【2604.25464】Image Compression with Bubble-Aware Frame Rate Adaptation for Energy-Efficient Video Capsule Endoscopy

链接https://arxiv.org/abs/2604.25464

作者:Oliver Bause,Jörg Gammerdinger,Julia Werner

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Capsule Endoscopy, Video Capsule, Capsule Endoscopy, frame rate adaptation, gastrointestinal tract

备注: 7 pages, 8 figures, EMBC2026

点击查看摘要

Abstract:Video Capsule Endoscopy (VCE) is a promising method for improving the medical examination of the small intestine in the gastrointestinal tract. A key challenge is their limited size, resulting in a short battery lifetime which conflicts with high energy consumption for image capturing and transmission to an on-body device. Thus, we propose an image compression pipeline that substantially reduces the transmitted data while preserving diagnostic image quality. Furthermore, we exploit characteristics of the compression process to identify frames with low diagnostic value mainly caused by bubbles, without requiring additional image analysis. For low-visibility frames, a dynamic bubble-aware frame rate adaptation strategy reduces image acquisition and transmission during these phases while preserving sensitivity to potential anomalies. The proposed compression and frame rate adaptation are evaluated on a RISC-V platform using the Kvasir-Capsule and Galar datasets. The compression method achieves a compression ratio of 5.748 (82.6%) at a peak signal-to-noise ratio of 40.3 dB, indicating negligible loss of visual quality. The compression accomplished a mean energy reduction of the whole system by 20.58%. Additionally, the proposed bubble-aware frame rate adaptation reduced the energy consumption by up to 40%. These results demonstrate the potential of our method to increase the applicability of VCE.

25. 【2604.25457】GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution

链接https://arxiv.org/abs/2604.25457

作者:Fabio D'Oronzio,Federico Putamorsi,Leonardo Zini,Marcella Cornia,Lorenzo Baraldi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single-image super-resolution, remains challenging, recent advances, real-world scenarios, scenarios with complex

备注: Accepted at the 28th International Conference on Pattern Recognition

点击查看摘要

Abstract:Despite recent advances, single-image super-resolution (SR) remains challenging, especially in real-world scenarios with complex degradations. Diffusion-based SR methods, particularly those built on Stable Diffusion, leverage strong generative priors but commonly rely on text conditioning derived from semantic captioning. Such textual descriptions provide only high-level semantics and lack the spatially aligned visual information required for faithful restoration, leading to a representation gap between abstract semantics and spatially aligned visual details. To address this limitation, we propose GramSR, a one-step diffusion-based SR framework that replaces text conditioning with dense visual features extracted from the low-resolution input using a pre-trained DINOv3 encoder. GramSR adopts a three-stage LoRA architecture, where pixel-level, semantic-level, and texture-level LoRA modules are trained sequentially. The pixel-level module focuses on degradation removal using $\ell_2$ loss, the semantic-level module enhances perceptual details via LPIPS and CSD losses, and the texture-level module enforces feature correlation consistency through a Gram matrix loss computed from DINOv3 features. At inference, independent guidance scales enable flexible control over degradation removal, semantic enhancement, and texture preservation. Extensive experiments on standard SR benchmarks demonstrate that GramSR consistently outperforms existing one-step diffusion-based methods, achieving superior structural fidelity and texture realism. The code for this work is available at: this https URL.

26. 【2604.25432】SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

链接https://arxiv.org/abs/2604.25432

作者:Zi-Yang Bo,Wei Lu,Hongruixuan Chen,Si-Bao Chen,Bin Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing imagery, degrading visual quality, shadow detection, sensing imagery, degrading visual

备注: 17 pages, 14 figures

点击查看摘要

Abstract:Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments demonstrate that SARU achieves state-of-the-art performance on both the public AISD dataset and our newly introduced benchmarks. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The source code and datasets are publicly available at: this https URL.

27. 【2604.25427】A Systematic Post-Train Framework for Video Generation

链接https://arxiv.org/abs/2604.25427

作者:Zeyue Xue,Siming Fu,Jie Huang,Shuai Lu,Haoran Li,Yijun Liu,Yuming Li,Xiaoxuan He,Mengzhao Chen,Haoyang Huang,Nan Duan,Ping Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantically rich content, demonstrated impressive capabilities, significant gap remains, large-scale video diffusion, Relative Policy Optimization

备注: Tech report

点击查看摘要

Abstract:While large-scale video diffusion models have demonstrated impressive capabilities in generating high-resolution and semantically rich content, a significant gap remains between their pretraining performance and real-world deployment requirements due to critical issues such as prompt sensitivity, temporal inconsistency, and prohibitive inference costs. To bridge this gap, we propose a comprehensive post-training framework that systematically aligns pretrained models with user intentions through four synergistic stages: we first employ Supervised Fine-Tuning (SFT) to transform the base model into a stable instruction-following policy, followed by a Reinforcement Learning from Human Feedback (RLHF) stage that utilizes a novel Group Relative Policy Optimization (GRPO) method tailored for video diffusion to enhance perceptual quality and temporal coherence; subsequently, we integrate Prompt Enhancement via a specialized language model to refine user inputs, and finally address system efficiency through Inference Optimization. Together, these components provide a systematic approach to improving visual quality, temporal coherence, and instruction following, while preserving the controllability learned during pretraining. The result is a practical blueprint for building scalable post-training pipelines that are stable, adaptable, and effective in real-world deployment. Extensive experiments demonstrate that this unified pipeline effectively mitigates common artifacts and significantly improves controllability and visual aesthetics while adhering to strict sampling cost constraints.

28. 【2604.25408】Beyond Fidelity: Semantic Similarity Assessment in Low-Level Image Processing

链接https://arxiv.org/abs/2604.25408

作者:Runjie Wang,Weiling Chen,Tiesong Zhao,Chang Wen Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-level image processing, visual fidelity, Semantic Similarity, semantic, long been evaluated

备注

点击查看摘要

Abstract:Low-level image processing has long been evaluated mainly from the perspective of visual fidelity. However, with the rise of deep learning and generative models, processed images may preserve perceptual quality while altering semantic content, making conventional Image Quality Assessment (IQA) insufficient for semantic-level assessment. In this paper, we formalize \textit{Semantic Similarity} as a new evaluation task for low-level image processing, aimed at measuring whether semantic content is preserved after processing. We further present a structured formulation of image semantics based on semantic entities and their relations, and discuss the desired properties and constraints of a valid semantic similarity index. Based on this formulation, we propose Triplet-based Semantic Similarity Score (T3S), which models image semantics through foreground entities, background entities, and relations. T3S combines semantic entity extraction, foreground-background disentanglement, and open-world class/relation modeling. Experiments on COCO and SPA-Data show that T3S consistently outperforms existing fidelity-oriented metrics and representative semantic-level baselines, while better reflecting progressive semantic changes under diverse degradations. These results highlight the importance of semantic assessment in modern low-level vision.

29. 【2604.25405】Leveraging Previous-Traversal Point Cloud Map Priors for Camera-Based 3D Object Detection and Tracking

链接https://arxiv.org/abs/2604.25405

作者:Markus Käppeler,Özgün Çiçek,Yakov Miron,Abhinav Valada

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:remains fundamentally constrained, localization remains fundamentally, depth-rich online LiDAR, autonomous driving, detection and tracking

备注

点击查看摘要

Abstract:Camera-based 3D object detection and tracking are central to autonomous driving, yet precise 3D object localization remains fundamentally constrained by depth ambiguity when no expensive, depth-rich online LiDAR is available at inference. In many deployments, however, vehicles repeatedly traverse the same environments, making static point cloud maps from prior traversals a practical source of geometric priors. We propose DualViewMapDet, a camera-only inference framework that retrieves such map priors online and leverages them to mitigate the absence of a LiDAR sensor during deployment. The key idea is a dual-space camera-map fusion strategy that avoids one-sided view conversion. Specifically, we (i) project the map into perspective view (PV) and encode multi-channel geometric cues to enrich image features and support BEV lifting, and (ii) encode the map directly in bird's-eye view (BEV) with a sparse voxel backbone and fuse it with lifted camera features in a shared metric space. Extensive evaluations on nuScenes and Argoverse 2 demonstrate consistent improvements over strong camera-only baselines, with particularly strong gains in object localization. Ablations further validate the contributions of PV/BEV fusion and prior-map coverage. We make the code and pre-trained models available at this https URL .

30. 【2604.25390】GeoSearch: Augmenting Worldwide Geolocalization with Web-Scale Reverse Image Search and Image Matching

链接https://arxiv.org/abs/2604.25390

作者:Tung-Duong Le-Duc,Hoang-Quoc Nguyen-Son,Minh-Son Dao

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Worldwide image geolocalization, global visual diversity, remains challenging due, Large Multimodal Models, predict the GPS

备注: Accepted to SIGIR 2026 Main Conference

点击查看摘要

Abstract:Worldwide image geolocalization, which aims to predict the GPS coordinates of any image on Earth, remains challenging due to global visual diversity. Recent generative approaches based on Retrieval-Augmented Generation (RAG) and Large Multimodal Models (LMMs) leverage candidates retrieved from fixed databases for reasoning, but often struggle with scenes that are absent from the reference set. In this work, we propose GeoSearch, an open-world geolocation framework that integrates web-scale reverse image search into the RAG pipeline. GeoSearch augments LMM prompts with database-retrieved coordinates and textual evidence extracted from web pages. To mitigate noise from irrelevant content, we introduce a two-layer filtering mechanism consisting of image matching, followed by confidence-based gating. Experiments on standard benchmarks Im2GPS3k and YFCC4k demonstrate the superiority of GeoSearch under leakage-aware evaluation. Our code and data are publicly available to support reproducibility.

31. 【2604.25388】COMPASS: COmpact Multi-channel Prior-map And Scene Signature for Floor-Plan-Based Visual Localization

链接https://arxiv.org/abs/2604.25388

作者:Muhammad Shaheer,Miguel Fernandez-Cortizas,Asier Bikandi-Noya,Holger Voos,Jose Luis Sanchez-Lopez

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Architectural floor plans, existing localization methods, localization methods largely, methods largely ignore, semantic information

备注

点击查看摘要

Abstract:Architectural floor plans are widely available priors which contain not only geometry but also the semantic information of the environment, yet existing localization methods largely ignore this semantic information. To address this, we present COMPASS, an algorithm that exploits both geometric and semantic priors from floor plans to estimate the pose of a robot equipped with dual fisheye cameras. Inspired by scan context descriptor from LiDAR-based place recognition, we design a multi-channel radial descriptor that encodes the geometric layout surrounding a position. From the floor plan, rays are cast in 360 azimuth bins and the results are encoded into five channels: normalized range, structural hit type (wall, window, or opening), range gradient, inverse range, and local range variance. From the image side, the same descriptor structure is populated by detecting structural elements in the fisheye imagery. As a first step toward full cross-modal matching, we present a window detection algorithm for fisheye images that uses a line segment detector to identify window frames via vertical edge clustering and brightness verification. Detected windows are projected to azimuthal bearings through the fisheye camera model, producing the hit-type channel of the visual descriptor. As a proof of concept, we generate both descriptors at a single known pose from the Hilti-Trimble SLAM Challenge 2026 dataset and demonstrate that the wall-window pattern extracted from the first frame of each camera closely matches the floor plan descriptor, validating the feasibility of cross-modal structural matching.

32. 【2604.25380】Benchmarking and Improving GUI Agents in High-Dynamic Environments

链接https://arxiv.org/abs/2604.25380

作者:Enqi Liu,Liyuan Pan,Zhi Gao,Yan Yang,Chenrui Shi,Yang Liu,Jingrong Wu,Qing Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, Graphical User, Recent advancements, advancements in Graphical, User Interface

备注

点击查看摘要

Abstract:Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

33. 【2604.25376】CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation

链接https://arxiv.org/abs/2604.25376

作者:Qianqian Chen,Anglin Liu,Jingyang Zhang,Yudong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Accurate brain lesion, employing Continual Learning, effective clinical diagnosis, Accurate brain, treatment planning

备注

点击查看摘要

Abstract:Accurate brain lesion segmentation in MRI is vital for effective clinical diagnosis and treatment planning. Due to high annotation costs and strict data privacy regulations, universal models require employing Continual Learning (CL) to adapt to evolving clinical tasks without losing previously acquired knowledge. However, existing CL paradigms often suffer from capacity limits or redundant parameter growth, and even advanced dynamic methods rely mostly on image-perception strategies that struggle to handle the substantial pathological and multimodal heterogeneity inherent in brain imaging. To address this issue, we propose Concept-Reasoning Expansion (CoRE) framework, which establishes a joint decision-making mechanism by integrating visual features with structured concepts. Through the alignment of image tokens with a hierarchical concept library, CoRE simulates clinical reasoning to guide both interpretable expert routing and demand-based model growth. This collaborative process ensures model evolution is grounded in clinical priors, preventing redundant parameter expansion while maximizing knowledge reuse. Extensive evaluations across 12 sequential brain lesion MRI tasks demonstrate that CoRE achieves state-of-the-art performance and provides a high knowledge starting point for efficient future adaptation. Its superior few-shot transferability and clinical interpretability further validate its effectiveness in managing non-stationary clinical data streams. Our code will be released soon.

34. 【2604.25370】GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment

链接https://arxiv.org/abs/2604.25370

作者:Kidus Zewde,Simiao Ren,Xingyu Shen,Jenny Wu,Yuchen Zhou,Tommy Duong,Zikang Zhang,Ethan Traister

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:difficult to discern, OpenAI marks, marks a watershed, watershed moment, boundary between photographic

备注: 11 pages; GPT-image-2 social media dataset; Twitter API collection and multilingual curation; C2PA watermark stripping on platform upload; browser-automated AI badge verification; CLIP semantic clustering; AI-generated image provenance and attribution

点击查看摘要

Abstract:The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model's April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter "Made with AI" badge verification, and model name variant matching, we curate 10,217 confirmed GPT-image-2 images from 27,662 collected records over a six-day window. We characterize the dataset across four analyses: CLIP-based zero-shot subject taxonomy, OCR text legibility (82.0% of images contain detectable text), face detection (59.2% of images, 22,583 total faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative result is that C2PA content credentials are systematically stripped by Twitter's CDN on upload, rendering cryptographic provenance verification infeasible for social-media-sourced AI images. The dataset and all curation code are released publicly.

Comments:
11 pages; GPT-image-2 social media dataset; Twitter API collection and multilingual curation; C2PA watermark stripping on platform upload; browser-automated AI badge verification; CLIP semantic clustering; AI-generated image provenance and attribution

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.25370 [cs.CV]

(or
arXiv:2604.25370v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.25370

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2604.25367】Self-DACE++: Robust Low-Light Enhancement via Efficient Adaptive Curve Estimation

链接https://arxiv.org/abs/2604.25367

作者:Jianyu Wen,Jun Xie,Feng Chen,Zhepeng Wang,Chenhao Wu,Tong Zhang,Yixuan Yu,Piotr Swierczynski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-Reference Deep Adaptive, Adaptive Curve Estimation, previous Self-Reference Deep, Deep Adaptive Curve, Adaptive Adjustment Curves

备注

点击查看摘要

Abstract:In this paper, we present Self-DACE++, an improved unsupervised and lightweight framework for Low-Light Image Enhancement (LLIE), building upon our previous Self-Reference Deep Adaptive Curve Estimation (Self-DACE). To better address the trade-off between computational efficiency and restoration quality, Self-DACE++ introduces enhanced Adaptive Adjustment Curves (AACs). These curves, governed by minimal trainable parameters, flexibly adjust the dynamic range while preserving the color fidelity, structural integrity, and naturalness of the enhanced images. To achieve an extremely lightweight architecture without sacrificing performance, we propose a randomized order training strategy coupled with a network fusion mechanism, which compresses the model into an efficient iterative inference structure. Furthermore, we formulate a physics-grounded objective function based on Retinex theory and incorporate a dedicated denoising module to effectively estimate and suppress latent noise in dark regions. Extensive qualitative and quantitative evaluations on multiple real-world benchmark datasets demonstrate that Self-DACE++ outperforms existing state-of-the-art methods, delivering superior enhancement quality with real-time inference capability. The code is available at this https URL.

36. 【2604.25361】HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

链接https://arxiv.org/abs/2604.25361

作者:Bingzi Zhang,Kaisi Guan,Ruihua Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating natural human, recent years, pivotal role, developed rapidly, rapidly in recent

备注: Accepted to the 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

点击查看摘要

Abstract:Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.

37. 【2604.25358】Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

链接https://arxiv.org/abs/2604.25358

作者:Luca Parolari,Nicla Faccioli,Lamberto Ballan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Evaluating layout-guided, fidelity to prescribed, Assessing layout alignment, layout alignment requires, models requires assessing

备注: Accepted to CVPRF 2026

点击查看摘要

Abstract:Evaluating layout-guided text-to-image generative models requires assessing both semantic alignment with textual prompts and spatial fidelity to prescribed layouts. Assessing layout alignment requires collecting fine-grained annotations, which is costly and labor-intensive. Consequently, current benchmarks rarely provide comprehensive layout evaluation and often remain limited in scale or coverage, making model comparison, ranking, and interpretation difficult. In this work, we introduce a closed-set benchmark (C-Bench) designed to isolate key generative capabilities while providing varying levels of complexity in both prompt structure and layout. To complement this controlled setting, we propose an open-set benchmark (O-Bench) that evaluates models using real-world prompts and layouts, offering a measure of semantic and spatial alignment in the wild. We further develop a unified evaluation protocol that combines semantic and spatial accuracy into a single score, ensuring consistent model ranking. Using our benchmarks, we conduct a large-scale evaluation of six state-of-the-art layout-guided diffusion models, totaling 319,086 generated and evaluated images. We establish a model ranking based on their overall performance and provide detailed breakdowns for text and layout alignment to enhance interpretability. Fine-grained analyses across scenarios and prompt complexities highlight the strengths and limitations of current models. Code is available at this https URL.

38. 【2604.25322】Assessment of the quantitative impact of occlusal positioning splints on temporomandibular joint conditions

链接https://arxiv.org/abs/2604.25322

作者:Agnieszka Anna Tomaka,Krzysztof Domino,Dariusz Pojda,Michał Tarnawski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:occlusal positioning splints, proposed and demonstrated, computational method, occlusal positioning, including CBCT

备注: 27 pages, 9 figures

点击查看摘要

Abstract:A computational method for quantitative analysis of temporomandibular joint (TMJ) configuration using occlusal positioning splints is proposed and demonstrated. The method models a positioning splint as a physical realization of a predefined rigid transformation of the mandible, derived from multimodal data, including CBCT, facial motion acquisition, and dental scans integrated within a common coordinate system. Splints corresponding to selected mandibular positions are designed and fabricated, and their positioning accuracy is evaluated using repeated scans of plaster models. Discrepancies are represented as error transformations and analyzed statistically in the space of rigid motions. The estimated transformations are propagated to segmented TMJ structures, enabling simulation-based evaluation of joint space changes. Transformation-based error analysis and surface distance metrics are used to quantify differences between planned and achieved configurations. The method enables indirect assessment of TMJ configuration using a single anatomical model and transformation data, reducing the need for repeated imaging across multiple mandibular positions. This study is intended as a methodological demonstration, supported by a clear step-by-step graphical presentation, and does not aim to provide clinical validation.

39. 【2604.25319】Edge-Cloud Collaborative Reconstruction via Structure-Aware Latent Diffusion for Downstream Remote Sensing Perception

链接https://arxiv.org/abs/2604.25319

作者:Yun Li,Xianju Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-resolution remote sensing, remote sensing data, sensing data faces, exponential surge, surge in high-resolution

备注: 6 pages, 3 figures

点击查看摘要

Abstract:The exponential surge in high-resolution remote sensing data faces a severe bottleneck in satellite-to-ground transmission. Limited downlink bandwidth forces the use of extreme high-ratio compression, which irreversibly destroys high-frequency structural details essential for downstream machine perception tasks like object detection. While current super-resolution techniques attempt to recover these details, regression-based methods often yield over-smoothed textures, and generative diffusion models frequently introduce structural hallucinations that mislead detection systems. To address this trade-off, we propose the Structure-Aware Latent Diffusion (SALD) framework, an asymmetric edge-cloud collaborative SR system. At the resource-constrained edge, the system decouples imagery into a highly compressed low-frequency payload and a lightweight soft structural prior. Transmitting this decoupled representation minimizes bandwidth consumption. On the powerful cloud side, we introduce a Structure-Gated Large Kernel (SGLK) module and a Semantic-Guidance Engine (SGE) within the diffusion backbone. These modules leverage the transmitted structural priors to gate large-kernel convolutions, effectively capturing long-range dependencies inherent in aerial scenes while actively suppressing generative hallucinations. Extensive experiments on both the MSCM and UCMerced datasets demonstrate that, even under extreme bandwidth constraints, SALD achieves superior perceptual quality (LPIPS) and significantly enhances downstream performance in both scene classification and small-target detection.

40. 【2604.25316】owards Robust Deep Learning-based Rumex Obtusifolius Detection from Drone Images

链接https://arxiv.org/abs/2604.25316

作者:Fabian Dionys Schrag,Mehmet Ozgur Turkoglu,Konrad Schindler,Ralph Lukas Stoop

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:machine learning model, learning model trained, addresses the challenge, challenge of transferring, transferring a machine

备注: under review

点击查看摘要

Abstract:Domain adaptation (DA) addresses the challenge of transferring a machine learning model trained on a source domain to a target domain with a different data distribution. In this work, we study DA for the task of Rumex obtusifolius (Rumex) image classification. We train models on a published, ground vehicle-based dataset (source) and evaluate their performance on a custom target dataset acquired by unmanned aerial vehicles (UAVs). We find that Convolutional Neural Network (CNN) models, specifically ResNets, generalize poorly to the target domain, even after fine-tuning on the source data. Applying moment-matching and maximum classifier discrepancy, two established DA techniques, substantially improves target-domain performance. However, Vision Transformer (ViT) models pretrained with self-supervised objectives (DINOv2, DINOv3) handle domain shifts intrinsically well, surpassing even moment-matching-trained ResNets, likely due to the rich, general-purpose representations acquired during large-scale pretraining. Using ViTs fine-tuned on the source dataset, we demonstrate high classification performances in the range of F1=0.8 on our target dataset. To support further research on DA for weed detection in grassland systems, we publicly release our UAV-based target dataset AGSMultiRumex, comprising data from 15 flights over Swiss meadows.

41. 【2604.25315】SaliencyDecor: Enhancing Neural Network Interpretability through Feature Decorrelation

链接https://arxiv.org/abs/2604.25315

作者:Ali Karkehabadi,Jamshid Hassanpour,Houman Homayoun,Avesta Sasan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, interpret deep neural, semantically meaningful input, meaningful input features, neural networks

备注: Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

点击查看摘要

Abstract:Gradient-based saliency methods are widely used to interpret deep neural networks, yet they often produce noisy and unstable explanations that poorly align with semantically meaningful input features. We argue that a fundamental cause of this behavior lies in the geometry of learned representations: correlated feature dimensions diffuse attribution gradients across redundant directions, resulting in blurred and unreliable saliency maps. To address this issue, we identify feature correlation as a structural limitation of gradient-based interpretability and propose SaliencyDecor, a training framework that enforces feature decorrelation to improve attribution fidelity without modifying saliency methods or model architectures by reshaping the feature space toward orthogonality, our approach promotes more concentrated gradient flow and improves the fidelity of saliency-based explanations. SaliencyDecor jointly optimizes classification, prediction consistency under feature masking, and a decorrelation regularizer, requiring no architectural changes or inference-time overhead. Extensive experiments across multiple benchmarks and architectures demonstrate that our method produces substantially sharper and more object-focused saliency maps while simultaneously improving predictive performance, achieving accuracy gains across the datasets. These results establish our method as a principled mechanism for enhancing both interpretability and accuracy, challenging the conventional trade-off between explanation quality and model performance.

42. 【2604.25314】Golden RPG: Confidence-Adaptive Region-Aware Noise for Compositional Text-to-Image Generation

链接https://arxiv.org/abs/2604.25314

作者:Hao Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:honour multiple sub-prompts, distinct image regions, generation requires, describe distinct image, honour multiple

备注: 13 pages

点击查看摘要

Abstract:Compositional text-to-image (T2I) generation requires a model to honour multiple sub-prompts that describe distinct image regions. Recent work shows that the \emph{starting noise} of a diffusion model carries significant semantic information: ``golden'' noise predicted from text can substantially raise prompt fidelity. We observe that this noise prediction is, however, fundamentally global: the same network is asked to summarise a long, multi-region prompt with a single text embedding, which becomes the bottleneck whenever the prompt describes scenes with spatially-separated entities. We introduce \textbf{Golden RPG}, a region-aware noise predictor that extends a frozen NPNet with two trainable additions: (i) a per-region \textbf{FiLM adapter} that reshapes the predicted noise according to each sub-prompt; and (ii) a \textbf{Region Cross-Attention} layer injected between two stages of the Swin backbone, allowing different spatial locations to attend to different sub-prompt tokens. To prevent the regional conditioning from degrading samples whose prompts are already easy, we further propose a \textbf{Confidence-Adaptive Blending} head that dynamically predicts, per sample, how strongly the regional signal should override the global signal. We evaluate on the original RPG benchmark (20 prompts, 100 samples) and on four multi-region categories of T2I-CompBench (1{,}200 images, six competing methods). Golden RPG achieves the highest Cross-Region-Coherence score on every category, while matching the strongest baselines on absolute CLIP-Score and CLIP-IQA. A paired user study further shows a $\boldsymbol{\sim}$67\% preference over the strongest baseline. The adapter contains $\sim$2M trainable parameters and adds only $0.6$\,s of inference overhead on top of SDXL.

43. 【2604.25310】Rapid tracking through strongly scattering media with physics-informed neuromorphic speckle analysis

链接https://arxiv.org/abs/2604.25310

作者:Yuqing Cao,Shuo Zhu,Rongzhou Chen,Jingyan Chen,Ni Chen,Edmund Y. Lam

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:tracking fast-moving objects, work addresses, addresses the critical, critical problem, fast-moving objects

备注

点击查看摘要

Abstract:This work addresses the critical problem of tracking fast-moving objects through strongly scattering media in a low-light environment. Different from existing approaches that use frame-based cameras with fixed exposure times, which trade off signal-to-noise ratio for temporal resolution, we introduce computational neuromorphic tracking (CNT), a physics-informed framework that combines asynchronous event sensing with task-driven speckle analysis for robust motion estimation. We formulate the neuromorphic speckle aggregation as a spatiotemporal speckle representation, jointly optimizing the temporal and spatial parameters to maximize tracking stability under extreme conditions. Extensive experiments demonstrate that our method enables robust motion tracking of 10x faster motion and under 10x dimmer illumination compared to conventional systems. These improvements significantly broaden the operational regime for tracking through scattering media, providing an efficient and scalable solution for demanding scenarios involving rapid motion and low-light conditions.

44. 【2604.25300】DenseScout: Algorithm-System Co-design for Budgeted Tiny Object Selection on Edge Platforms

链接https://arxiv.org/abs/2604.25300

作者:Xiong Zhouzhi,Zimo Zeng,Yi Chen,Shuqi Xu,Yunfeng Yan,Donglian Qi

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:strict compute budgets, Deploying tiny object, latency constraints, Deploying tiny, challenging because practical

备注: 19 pages, 8 figures

点击查看摘要

Abstract:Deploying tiny object perception on edge platforms is challenging because practical systems must satisfy both strict compute budgets and end-to-end latency constraints. A common strategy is to first select a small number of candidate patches from a high-resolution image and then apply downstream processing only to the selected regions. However, existing detector-based frontends are not well aligned with this setting: strong offline detection accuracy does not necessarily yield effective low-budget patch prioritization, nor does it guarantee usable performance once transport and inference delays are considered. In this work, we study budgeted tiny object selection on edge platforms from a joint algorithm--system perspective. We present DenseScout, a lightweight dense-response selector with only 1.01M parameters, which directly ranks candidate patch locations from a high-resolution scene via a lightweight proxy input and is better aligned with low-budget tiny-object prioritization than detector-style frontends. To bridge offline selector quality and deployable utility, we further develop a transport-aware runtime realization on heterogeneous edge devices and adopt QoS-constrained recall, which counts a target as successfully perceived only if it is covered by the selected regions and the end-to-end processing finishes before the deadline. Experiments show that DenseScout consistently outperforms detector-based baselines in offline budgeted patch-selection evaluation, especially in low-budget regimes, while cross-platform results on RK3588 and Jetson Orin NX show that deployable performance depends jointly on selector quality and runtime realization efficiency. These results suggest that edge tiny object perception should be optimized as an algorithm--system co-design problem rather than as isolated model selection.

45. 【2604.25299】he Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

链接https://arxiv.org/abs/2604.25299

作者:Yuwei Sun,Yuxuan Yao,Hui Li,Siyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-fidelity data synthesis, tasks remains constrained, data synthesis, remains constrained, achieved success

备注

点击查看摘要

Abstract:Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.

46. 【2604.25289】Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

链接https://arxiv.org/abs/2604.25289

作者:Liuzhuozheng Li,Zhiyuan Zhan,Shuhong Liu,Dengyang Jiang,Zanyi Wang,Guang Dai,Jingdong Wang,Mengmeng Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:models typically requires, typically requires explicit, time conditioning, typically requires, guide the network

备注

点击查看摘要

Abstract:Practically, training diffusion models typically requires explicit time conditioning to guide the network through the denoising sampling process. Especially in deterministic methods like DDIM, the absence of time conditioning leads to significant performance degradation. However, other deterministic sampling approaches, such as flow matching, can generate high-quality content without this conditioning, raising the question of its necessity. In this work, we revisit the role of time conditioning from a geometric perspective. We analyze the evolution of noisy data distributions under the forward diffusion process and demonstrate that, in high-dimensional spaces, these distributions concentrate on low-dimensional hyper-cylinder-like manifolds embedded within the input space. Successful generation, we argue, stems from the disentanglement of these manifolds in high-dimensional space. Based on this insight, we modify the forward process of DDIM to align the noisy data manifold with the flow-matching approach, proving that DDIM can generate high-quality content without time conditioning, provided the noisy manifold evolves according to the flow-matching method. Additionally, we extend our framework to class-conditioned generation by decoupling classes into distinct time spaces, enabling class-conditioned synthesis with a class-unconditional denoising model. Extensive experiments validate our theoretical analysis and show that high-quality generation is achievable without explicit conditional embeddings.

47. 【2604.25276】OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

链接https://arxiv.org/abs/2604.25276

作者:Minghang Zheng,Zihao Yin,Yi Yang,Yuxin Peng,Yang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Video Temporal Grounding, open-world settings due, localizing video segments

备注: CVPR 2026

点击查看摘要

Abstract:Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at this https URL.

48. 【2604.25273】Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval

链接https://arxiv.org/abs/2604.25273

作者:Guosheng Zhang,Linkai Liu,Keyao Wang,Haixiao Yue,Zhiwen Tan,Xiao Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified Multimodal Retrieval, Large Multimodal Models, powered by Large, progress in Unified, Unified Multimodal

备注

点击查看摘要

Abstract:Despite significant progress in Unified Multimodal Retrieval (UMR) powered by Large Multimodal Models (LMMs), existing embedding methods primarily focus on sample-level objectives via contrastive learning while overlooking the crucial subject-level semantics. This limitation hinders the model's ability to group semantically coherent subjects in complex multimodal queries, manifesting as semantic alignment deviation--where models fail to accurately localize salient text-referred regions in visual content. Moreover, without explicit guidance to model salient visual subjects, LMMs tend to over-rely on textual cues, resulting in visual modality neglect and suboptimal utilization of visual knowledge. To this end, we propose Salient Subject-Aware Multimodal Embedding (SSA-ME), a novel framework designed to enhance fine-grained representation learning through saliency-aware modeling. SSA-ME leverages LMMs and visual experts to identify and emphasize salient visual concepts in image-text pairs, and introduces a saliency-guided objective to better align cross-modal attention with semantically meaningful regions. Additionally, a feature regeneration module recalibrates visual features based on the derived saliency maps, ensuring a balanced and semantically coherent integration across modalities. Extensive experiments show that our method achieves state-of-the-art performance on the MMEB benchmark, demonstrating that incorporating subject-level modeling substantially improves multimodal retrieval. Comprehensive qualitative analyses further illustrate the interpretability and effectiveness of our approach.

49. 【2604.25255】Personalized Cross-Modal Emotional Correlation Learning for Speech-Preserving Facial Expression Manipulation

链接https://arxiv.org/abs/2604.25255

作者:Tianshui Chen,Yujie Zhu,Jianman Lin,Zhijing Yang,Chunmei Qing,Feng Gao,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Speech-preserving facial expression, enhance human expressiveness, altering mouth movements, mouth movements tied, Speech-preserving facial

备注

点击查看摘要

Abstract:Speech-preserving facial expression manipulation (SPFEM) aims to enhance human expressiveness without altering mouth movements tied to the original speech. A primary challenge in this domain is the scarcity of paired data, namely aligned frames of the same individual with identical speech but different expressions, which impedes direct supervision for emotional manipulation. While current Visual-Language Models (VLMs) can extract aligned visual and semantic features, making them a promising source of supervision, their direct application is limited. To this end, we propose a Personalized Cross-Modal Emotional Correlation Learning (PCMECL) algorithm that refines VLM-based supervision through two major improvements. First, standard VLMs rely on a single generic prompt for each emotion, failing to capture expressive variations among individuals. PCMECL addresses this limitation by conditioning on individual visual information to learn personalized prompts, thereby establishing more fine-grained visual-semantic correlations. Second, even with personalization, inherent discrepancies persist between the visual and semantic feature distributions. To bridge this modality gap, PCMECL employs feature differencing to correlate the modalities, providing more precisely aligned supervision by matching the change in visual features to the change in semantic features. As a plug-and-play module, PCMECL can be seamlessly integrated into existing SPFEM models. Extensive experiments across various datasets demonstrate the superior efficacy of our algorithm.

50. 【2604.25235】VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

链接https://arxiv.org/abs/2604.25235

作者:Divake Kumar,Sina Tayebati,Devashri Naik,Ranganath Krishnan,Amit Ranjan Trivedi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Vision-language models, provide no indication, Vision-language, multimodal systems, automated judges

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: this https URL

51. 【2604.25231】DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

链接https://arxiv.org/abs/2604.25231

作者:Anirudh Iyengar Kaniyar Narayana Iyengar,Tampu Ravi Kumar,Gaurav Najpande,Manan Suri,Dinesh Manocha,Puneet Mathur,Vivek Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:interpret structured visual, structured visual representations, circuit schematics, DQA, interpret structured

备注: 22 Pages, 14 Figures

点击查看摘要

Abstract:Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.

52. 【2604.25213】When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

链接https://arxiv.org/abs/2604.25213

作者:Jiaqi Wu,Yuchen Zhou,Dennis Tsang Ng,Xingyu Shen,Kidus Zewde,Ankit Raj,Tommy Duong,Simiao Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:AI-edited document images, effectively erased, erased the visual, visual boundary, single number

备注

点击查看摘要

Abstract:OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site this http URL), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962-0.599 TruFor; 0.852-0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.

53. 【2604.25208】owards Seamless Lunar Mosaics: Deep Radiometric Normalization for Cross-Sensor Orbital Imagery Using Chandrayaan-2 TMC Data

链接https://arxiv.org/abs/2604.25208

作者:Pratincha Singh,Jai Gopal Singla,Prashant Hemrajani,Nitant Dube,Amithabh,Hinal Patel

类目:Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)

关键词:Radiometric inconsistencies remain, Terrain Mapping Camera, Wide Angle Camera, sensor characteristics, seamless lunar mosaics

备注

点击查看摘要

Abstract:Radiometric inconsistencies remain a major challenge in generating seamless lunar mosaics from multi-mission orbital imagery due to variability in illumination geometry, sensor characteristics, and acquisition conditions. This paper presents a deep learning-based radiometric normalization framework for multi-mission lunar mosaics constructed primarily from ISRO's Chandrayaan-2 Terrain Mapping Camera (TMC) data, supplemented with auxiliary imagery from the SELENE (Kaguya) mission. The proposed approach employs a conditional generative adversarial network (cGAN) comprising a U-Net-based generator and a PatchGAN discriminator to learn a nonlinear radiometric mapping from conventionally mosaicked lunar imagery to a photometrically consistent reference derived from LROC Wide Angle Camera (WAC) data. A patch-based training strategy with overlap-aware inference is adopted to enable scalable processing of large-area mosaics while preserving structural continuity across tile boundaries. Quantitative evaluation using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Root Mean Square Error (RMSE) demonstrates consistent improvements over traditional histogram-based normalization techniques. The proposed framework achieves enhanced tonal uniformity, reduced seam artifacts, and improved structural coherence across multi-source lunar datasets. These results highlight the effectiveness of learning-based radiometric normalization for large-scale planetary mosaicking and demonstrate its potential for generating high-fidelity lunar surface maps from heterogeneous orbital imagery.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)

Cite as:
arXiv:2604.25208 [cs.CV]

(or
arXiv:2604.25208v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.25208

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jai Singla [view email] [v1]
Tue, 28 Apr 2026 04:26:03 UTC (1,808 KB)

54. 【2604.25188】Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

链接https://arxiv.org/abs/2604.25188

作者:Wentao Jiang,Yuanchan Xu,Heng Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Random Dilated Convolution, Image classification remains, fine-grained feature extraction, Image Classification Network, Fine-Grained Feature Enhancement

备注

点击查看摘要

Abstract:Image classification remains a fundamental yet challenging task in computer vision, particularly when fine-grained feature extraction and background noise suppression are required simultaneously. Conventional convolutional neural networks, despite their remarkable success in hierarchical feature learning, often struggle with capturing multi-scale contextual information and are susceptible to overfitting when confronted with noisy or irrelevant image regions. In this paper, we propose RDCNet (Image Classification Network with Random Dilated Convolution), a novel architecture built upon ResNet-34 that integrates three synergistic innovations to address these limitations: (1) a Multi-Branch Random Dilated Convolution (MRDC) module that employs parallel branches with varying dilation rates combined with a stochastic masking mechanism to capture fine-grained features across multiple scales while enhancing robustness against noise and overfitting; (2) a Fine-Grained Feature Enhancement (FGFE) module embedded within MRDC that bridges global contextual information with local feature representations through adaptive pooling and bilinear interpolation, thereby amplifying sensitivity to subtle visual patterns; and (3) a Context Excitation (CE) module that leverages softmax-based spatial attention and channel recalibration to dynamically emphasize task-relevant features while suppressing background interference. Extensive experiments conducted on five benchmark datasets -- CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof -- demonstrate that RDCNet consistently achieves state-of-the-art classification accuracy, outperforming the second-best competing methods by margins of 0.02\%, 1.12\%, 0.18\%, 4.73\%, and 3.56\%, respectively, thereby validating the effectiveness and generalizability of the proposed approach across diverse visual recognition scenarios.

55. 【2604.25186】FCMBench-Video: Benchmarking Document Video Intelligence

链接https://arxiv.org/abs/2604.25186

作者:Runze Cui,Fangxin Shang,Yehui Yang,Qing Yang,Tao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Multimedia (cs.MM)

关键词:financial credit review, evidence traceability matter, remote verification, traceability matter, financial credit

备注

点击查看摘要

Abstract:Document understanding is a critical capability in financial credit review, onboarding, and remote verification, where both decision accuracy and evidence traceability matter. Compared with static document images, document videos present a temporally redundant and sequentially unfolding evidence stream, require evidence integration across frames, and preserve acquisition-process cues relevant to authenticity-sensitive and anti-fraud review. We introduce FCMBench-Video, a benchmark for document-video intelligence that evaluates document perception, temporal grounding, and evidence-grounded reasoning under realistic capture conditions. For privacy-compliant yet realistic data at scale, we organize construction as an atomic-acquisition and composition workflow that records reusable single-document clips, applies controlled degradations, and assembles long-form multi-document videos with prescribed temporal spans. FCMBench-Video is built from 495 atomic videos composed into 1,200 long-form videos paired with 11,322 expert-annotated question--answer instances, covering 28 document types over 20s--60s duration tiers and 5,960 Chinese / 5,362 English instances. Evaluations on nine recent Video-MLLMs show that FCMBench-Video provides meaningful separation across systems and capabilities: counting is the most duration-sensitive task, Cross-Document Validation and Evidence-Grounded Selection probe higher-level evidence integration, and Visual Prompt Injection provides a complementary robustness dimension. The overall score distribution is broad and approximately bell-shaped, indicating a benchmark that is neither saturated nor dominated by trivial cases. Together, these results position FCMBench-Video as a reproducible benchmark for tracking Video-MLLM progress on document-video understanding and probing capability boundaries in authenticity-sensitive credit-domain applications.

56. 【2604.25178】Lightweight Real-Time Rendering Parameter Optimization via XGBoost-Driven Lookup Tables

链接https://arxiv.org/abs/2604.25178

作者:Baijun Tan,Francesco Moretti

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:resource-constrained mobile devices, Achieving a desirable, rendering time, rendering, rendering parameter optimization

备注

点击查看摘要

Abstract:Achieving a desirable balance between rendering quality and real-time performance is a long-standing challenge in modern game and rendering engines, particularly on resource-constrained mobile devices such as laptops, tablets, and smartphones. Existing approaches to automatic rendering parameter optimization either depend on exhaustive per-scene pre-computation that spans several days, suffer from the prohibitive inference overhead of neural networks that prevents per-frame adaptation, or lack generalizability across heterogeneous hardware and diverse scenes. In this paper, we propose \textbf{LUT-Opt}, a lightweight, general-purpose framework for adaptive per-frame rendering parameter optimization. Our method decomposes the joint optimization of rendering time and image quality into a tractable two-stage pipeline. In the offline stage, we train a pair of XGBoost regressors to predict rendering time and image quality from rendering parameters, hardware state, and scene complexity descriptors. The trained ensemble models are then distilled into compact lookup tables (LUTs) through systematic discretization and a two-phase linear search that first constrains rendering time and subsequently maximizes structural similarity (SSIM). During runtime, the pre-computed LUT is queried every frame in sub-millisecond time, enabling truly adaptive parameter selection with negligible computational overhead. We validate LUT-Opt on two representative rendering techniques -- subsurface scattering (SSS) and hybrid-pipeline ambient occlusion (AO) -- implemented within Unreal Engine 5. Extensive experiments across multiple scenes and GPU configurations demonstrate that LUT-Opt reduces subsurface scattering rendering time by approximately 40\% and ambient occlusion rendering time by roughly 70\%, while incurring only about 2\% increase in image quality error, with per-frame inference latency below 0.1\ ms.

57. 【2604.25176】Benchmarking OCR Pipelines with Adaptive Enhancement for Multi-Domain Retail Bill Digitization

链接https://arxiv.org/abs/2604.25176

作者:Vijaysinh Gaikwad

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:billing documents remains, challenging task due, retail billing documents, Optical Character Recognition, Convolutional Neural Network

备注

点击查看摘要

Abstract:The digitization of multi-domain retail billing documents remains a challenging task due to variability in scan quality, layout heterogeneity, and domain diversity across commercial sectors. This paper proposes and benchmarks an intelligent, quality-aware adaptive Optical Character Recognition (OCR) pipeline for retail bill digitization spanning five domains: grocery stores, restaurants, hardware shops, footwear outlets, and clothing retailers. The proposed system integrates a Convolutional Neural Network (CNN)-based image enhancement module trained via self-supervised denoising, a Laplacian variance-based image quality analyzer with three-tier routing, a confidence-driven adaptive feedback loop with iterative retry, and an NLP-based post-OCR correction layer. Experiments were conducted on a real-world dataset of 360 heterogeneous retail bill images. Ground truth for quantitative evaluation was generated using an OCR ensemble majority voting strategy, a validated approach for scenarios without manual annotation. The proposed pipeline achieves a Character Error Rate (CER) of 18.4% and Word Error Rate (WER) of 27.6%, representing improvements of 26.4% and 31.2% respectively over the Raw Tesseract baseline. The pipeline additionally achieves a text density of 108.3 words per image, a noise ratio of 2.3%, and a processing time of 3.64 seconds per image - a 6.4x speed advantage over EasyOCR. Image quality PSNR analysis on enhanced MEDIUM and LOW quality images yields an average of 28.7 dB, confirming meaningful enhancement. These results establish a reproducible benchmark for multi-domain retail bill OCR research.

58. 【2604.25164】IAM: Identity-Aware Human Motion and Shape Joint Generation

链接https://arxiv.org/abs/2604.25164

作者:Wenqi Jia,Zekun Li,Abhay Mittal,Chengcheng Tang,Chuan Guo,Lezi Wang,James Matthew Rehg,Lingling Tao,Size An

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, text-driven human motion, motion, advances in text-driven, text-driven human

备注

点击查看摘要

Abstract:Recent advances in text-driven human motion generation enable models to synthesize realistic motion sequences from natural language descriptions. However, most existing approaches assume identity-neutral motion and generate movements using a canonical body representation, ignoring the strong influence of body morphology on motion dynamics. In practice, attributes such as body proportions, mass distribution, and age significantly affect how actions are performed, and neglecting this coupling often leads to physically inconsistent motions. We propose an identity-aware motion generation framework that explicitly models the relationship between body morphology and motion dynamics. Instead of relying on explicit geometric measurements, identity is represented using multimodal signals, including natural language descriptions and visual cues. We further introduce a joint motion-shape generation paradigm that simultaneously synthesizes motion sequences and body shape parameters, allowing identity cues to directly modulate motion dynamics. Extensive experiments on motion capture datasets and large-scale in-the-wild videos demonstrate improved motion realism and motion-identity consistency while maintaining high motion quality. Project page: this https URL

59. 【2604.25129】8DNA: 8D Neural Asset Light Transport by Distribution Learning

链接https://arxiv.org/abs/2604.25129

作者:Liwen Wu,Haolin Lu,Bing Xu,Miloš Hašan,Ravi Ramamoorthi

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibit intriguing global, long scattering paths, fine-scale fiber scatterings, involve long scattering, glossy interreflections

备注

点击查看摘要

Abstract:High-fidelity 3D assets exhibit intriguing global illumination effects like subsurface scattering, glossy interreflections, and fine-scale fiber scatterings, which often involve long scattering paths that are expensive to simulate. We introduce 8D neural assets (8DNA) to pre-bake these light transport effects into neural representations. Unlike prior methods that assume far-field lighting and precompute light transport into 6D functions, 8DNA learns the full 8D light transport, enabling accurate rendering under near-field illumination. Our training leverages a distribution-learning formulation that learns light transport from forward path-traced samples, which produces less optimization variance with lower training budget than the prior regression-based approaches. Experiments show our 8DNA rendering closely matches path-traced results under various scene configurations, yet it achieves improved variance reduction and fast inference speeds on challenging assets.

60. 【2604.25128】ResetEdit: Precise Text-guided Editing of Generated Image via Resettable Starting Latent

链接https://arxiv.org/abs/2604.25128

作者:Hanyi Wang,Han Fang,Zheng Wang,Shilin Wang,Ee-Chien Chang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modifies local regions, Recent advances, preserving global structure, enabled high-quality image, leading to increasing

备注

点击查看摘要

Abstract:Recent advances in diffusion models have enabled high-quality image generation, leading to increasing demand for post-generation editing that modifies local regions while preserving global structure. Achieving such flexible and precise editing requires a high-quality starting point, a latent representation that provides both the freedom needed for diverse modifications and the precision required for fine-grained, region-specific control. However, existing inversion-based approaches such as DDIM inversion often yield unsatisfactory starting latents, resulting in degraded edit fidelity and structural inconsistency. Ideally, the most suitable editing anchor should be the original latent used during the generation process, as it inherently captures the scene's structure and semantics. Yet, storing this latent for every generated image is impractical due to massive storage and retrieval costs. To address this challenge, we propose ResetEdit, a proactive diffusion editing framework that embeds recoverable latent information directly into the generation process. By injecting the discrepancy between the clean and diffused latents into the diffusion trajectory and extracting it during inversion, ResetEdit reconstructs a resettable latent that closely approximates the true starting state. Additionally, a lightweight latent optimization module compensates for reconstruction bias caused by VAE asymmetry. Built upon Stable Diffusion, ResetEdit integrates seamlessly with existing tuning-free editing methods and consistently outperforms state-of-the-art baselines in both controllability and visual fidelity.

61. 【2604.25122】M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering

链接https://arxiv.org/abs/2604.25122

作者:Jiatong Ma,Longteng Guo,Yuchen Liu,Zijia Zhao,Dongze Hao,Xuanxu Lin,Jing Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Question Answering, knowledge-based Visual Question, Question Answering, large language models, multimodal large language

备注

点击查看摘要

Abstract:We present M$^3$-VQA, a novel knowledge-based Visual Question Answering (VQA) benchmark, to enhance the evaluation of multimodal large language models (MLLMs) in fine-grained multimodal entity understanding and complex multi-hop reasoning. Unlike existing VQA datasets that focus on coarse-grained categories and simple reasoning over single entities, M$^3$-VQA introduces diverse multi-entity questions involving multiple distinct entities from both visual and textual sources. It requires models to perform both sequential and parallel multi-hop reasoning across multiple documents, supported by traceable, detailed evidence and a curated multimodal knowledge base. We evaluate 16 leading MLLMs under three settings: without external knowledge, with gold evidence, and with retrieval-augmented input. The poor results reveal significant challenges for MLLMs in knowledge acquisition and reasoning. Models perform poorly without external information but improve markedly when provided with precise evidence. Furthermore, reasoning-aware agentic retrieval surpasses heuristic methods, highlighting the importance of structured reasoning for complex multimodal understanding. M$^3$-VQA presents a more challenging evaluation for advancing the multimodal reasoning capabilities of MLLMs. Our code and dataset are available at this https URL.

62. 【2604.25102】One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

链接https://arxiv.org/abs/2604.25102

作者:Ravikumar Balakrishnan,Sanket Mendapara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Typographic prompt injection, vision language models', power autonomous agents, prompt injection exploits, injection exploits vision

备注

点击查看摘要

Abstract:Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.

63. 【2604.25072】Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

链接https://arxiv.org/abs/2604.25072

作者:Weixing Wang,Liudvikas Zekas,Anton Hackl,Constantin Alexander Auga,Parisa Shahabinejad,Jona Otholt,Antonio Rueda-Toicen,Gerard de Melo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aim to support, shared representation, Unified Multimodal Models, Unified Multimodal, Unified

备注

点击查看摘要

Abstract:Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.

64. 【2604.25071】Scalable Secure Biometric Authentication without Auxiliary Identifiers

链接https://arxiv.org/abs/2604.25071

作者:Alexander Bienstock,Daniel Escudero,Antigoni Polychroniadou,Zhen Zeng,Pranav Bhat,Ashok Singal,Prashant Sharma,Manuela Veloso

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:biometric authentication, biometric authentication systems, weak passwords, elimination of weak, authentication

备注

点击查看摘要

Abstract:The prevalence of biometric authentication has been on the rise due to its ease of use and elimination of weak passwords. To date, most biometric authentication systems have been designed for on-device authentication of the device owner (e.g., smartphones and laptops). Recently, biometric authentication systems have started to emerge that are designed to authenticate users against cloud databases storing representations of biometrics for large numbers of users (potentially millions), such as those facilitating biometric payments. However, the use of a large cloud database introduces a significant attack vector, as a breach of the database could lead to the compromise of all enrolled users' sensitive biometric data. Indeed, all such existing systems either do not adequately protect against such a breach, or are impractical to deploy and use due to their high computational overhead. In this work, we present a new biometric authentication system that provides provable security guarantees against data breaches, while remaining scalable and performant. To do so, we marry artificial intelligence with advanced cryptographic techniques in a novel fashion, providing several optimizations along the way. Our work is the first to show that real-world scalable privacy-preserving biometric authentication without auxiliary identifiers is feasible, and we believe that it will spur widespread industrial adoption and further research in this area.

65. 【2604.25065】ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

链接https://arxiv.org/abs/2604.25065

作者:Jong Woo Nam,Amanda S. Rios,Bartlett W. Mel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:humans relies heavily, relies heavily, ability to recognize, humans relies, recognize objects

备注

点击查看摘要

Abstract:Object recognition (OR) in humans relies heavily on shape cues and the ability to recognize objects across varying 3D viewpoints. Unlike humans, deep networks often rely on non-shape cues such as texture and background, leading to vulnerabilities in generalization and robustness. To address this gap, we introduce ShapeY, a novel and principled benchmarking framework designed to evaluate shape-based recognition capability in OR systems. ShapeY comprises 68,200 grayscale images of 200 3D objects rendered from multiple viewpoints and optionally subjected to non-shape ``appearance'' changes. Using a nearest-neighbor matching task, ShapeY specifically probes the fine-grained structure of an OR system's embedding space by evaluating whether object views are clustered by 3D shape similarity across varying 3D viewpoints and other non-shape changes. ShapeY provides a suite of quantitative and qualitative performance readouts, including error rate graphs, viewpoint tuning curves, histograms of positive and negative matching scores, and grids showing ordered best matches, which together offer a comprehensive evaluation of an OR system's shape understanding capability. Testing of 321 pre-trained networks with diverse architectures reveals significant challenges in achieving robust shape-based recognition: even state-of-the-art models struggle to generalize consistently across 3D viewpoint and appearance changes, and are prone to infrequent but egregious matches of objects of obviously completely different shape. ShapeY establishes a principled framework for advancing artificial vision systems toward human-like shape recognition capabilities, emphasizing the importance of disentangled and invariant object encodings.

66. 【2604.24999】BifDet: A 3D Bifurcation Detection Dataset for Airway-Tree Modeling

链接https://arxiv.org/abs/2604.24999

作者:Ali Keshavarzi,Quentin Bouniot,Benjamin M. Smith,Elsa Angelini

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Thoracic Computed Tomography, Thoracic Computed, Computed Tomography, intricate branching network, offer detailed insights

备注: This manuscript is currently in preparation for submission

点击查看摘要

Abstract:Thoracic Computed Tomography (CT) scans offer detailed insights into the intricate branching network of the airway tree, which is essential for understanding various respiratory diseases. Airway bifurcations, where airway branches split, are crucial landmarks for understanding lung physiology, disease mechanisms and lesion localization. Despite the significance of bifurcation analysis, a notable lack of datasets annotated for this task hinders the development of advanced automated specialized detection or segmentation tools. In this paper, we introduce BifDet, the first publicly-available dataset specialized for 3D airway bifurcation detection, filling a critical gap in existing resources. Our dataset comprises carefully annotated CT scans from the ATM22 open-access cohort with bifurcation bounding boxes covering the parent and daughter branches. As a use-case for demonstrating the potential of BifDet, we fine-tune and evaluate RetinaNet and DETR for 3D airway bifurcations detection on CT scans. We provide detailed pipelines, including preprocessing steps and specific implementation design choices. Results are detailed over various categories of minimal bounding box sizes to serve as baseline to benchmark future research.

67. 【2604.24997】DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

链接https://arxiv.org/abs/2604.24997

作者:Mohamad Zamini,Diksha Shukla

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-vocabulary semantic segmentation, assigning pixel-level semantic, pixel-level semantic labels, Open-vocabulary semantic, semantic segmentation requires

备注

点击查看摘要

Abstract:Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

68. 【2604.24994】Power Foam: Unifying Real-Time Differentiable Ray Tracing and Rasterization

链接https://arxiv.org/abs/2604.24994

作者:Shrisudhan Govindarajan,Daniel Rebain,Dor Verbin,Kwang Moo Yi,Anish Prabhu,Andrea Tagliasacchi

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:modern rasterization pipelines, capabilities of foam-based, ray tracing capabilities, foam-based ray tracing, rasterization pipelines

备注

点击查看摘要

Abstract:We introduce a differentiable 3D representation that unifies the ray tracing capabilities of foam-based ray tracing with the efficiency of modern rasterization pipelines. While prior foam representations enable constant-time ray traversal through an explicit volumetric partition of space, their potentially unbounded cells hinder efficient tile-based rasterization. We address this limitation by generalizing Voronoi foams to bounded power diagrams with controllable cell extents, enabling spatially bounded primitives without requiring expensive Delaunay triangulations during training. We further introduce an oriented surface formulation that explicitly models interfaces between interior and exterior regions, and decouple geometry from appearance by embedding differentiable texture directly on these surfaces. Together, these contributions yield a representation that preserves state-of-the-art ray tracing efficiency while achieving rasterization performance competitive with current generation 3DGS, providing a practical path toward unified real-time differentiable rendering.

69. 【2604.24990】A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

链接https://arxiv.org/abs/2604.24990

作者:Martin Spitznagel,Janis Keuper

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Stephen Wolfram proclaimed, Neural Cellular Automata, simple recursive programs, Kind Of Science, Cellular Automata

备注

点击查看摘要

Abstract:Stephen Wolfram proclaimed in his 2003 seminal work "A New Kind Of Science" that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch.

70. 【2604.24954】Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

链接https://arxiv.org/abs/2604.24954

作者:NVIDIA:Amala Sanjay Deshmukh,Kateryna Chumachenko,Tuomas Rintamaki,Matthieu Le,Tyler Poon,Danial Mohseni Taheri,Ilia Karmanov,Guilin Liu,Jarno Seppanen,Arushi Goel,Mike Ranzinger,Greg Heinrich,Guo Chen,Lukas Voegtle,Philipp Fischer,Timo Roman,Karan Sapra,Collin McCarthy,Shaokun Zhang,Fuxiao Liu,Hanrong Ye,Yi Dong,Mingjie Liu,Yifan Peng,Piotr Zelasko,Zhehuai Chen,Nithin Rao Koluguri,Nune Tadevosyan,Lilit Grigoryan,Ehsan Hosseini Asl,Pritam Biswas,Leili Tavabi,Yuanhang Su,Zhiding Yu,Peter Jin,Alexandre Milesi,Netanel Haber,Yao Xu,Sarah Amiraslani,Nabin Mulepati,Eric Tramel,Jaehun Jung,Ximing Lu,Brandon Cui,Jin Xu,Zhiqi Li,Shihao Wang,Yuanguo Kuang,Shaokun Zhang,Huck Yang,Boyi Li,Hongxu Yin,Song Han,Pavlo Molchanov,Adi Renduchintala,Charles Wang,David Mosallanezhad,Soumye Singhal,Luis Vega,Katherine Cheung,Sreyan Ghosh,Yian Zhang,Alexander Bukharin,Venkat Srinivasan,Johnny Greco,Andre Manoel,Maarten Van Segbroeck,Suseella Panguliri,Rohit Watve,Divyanshu Kakwani,Shubham Pachori,Jeffrey Glick,Radha Sri-Tharan,Aileen Zaman,Khanh Nguyen,Shi Chen,Jiaheng Fang,Qing Miao,Wenfei Zhou,Yu Wang,Zaid Pervaiz Bhat,Varun Praveen,Arihant Jain,Ramanathan Arunachalam,Tomasz Kornuta,Ashton Sharabiani,Amy Shen,Wei Huang,Yi-Fu Wu,Ali Roshan Ghias,Huiying Li,Brian Yu,Nima Tajbakhsh,Chen Cui,Wenwen Gao,Li Ding,Terry Kong,Manoj Kilaru,Anahita Bhiwandiwalla

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:inputs alongside text, natively support audio, support audio inputs, audio inputs alongside, Nano Omni

备注

点击查看摘要

Abstract:We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

71. 【2604.24953】ViPO: Visual Preference Optimization at Scale

链接https://arxiv.org/abs/2604.24953

作者:Ming Li,Jie Wu,Justin Cui,Xiaojie Li,Rui Wang,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains largely unexplored, paradigm remains largely, largely unexplored, crucial for improving, effectively scale

备注: Project Page: [this https URL](https://liming-ai.github.io/ViPO;) Code: [this https URL](https://github.com/liming-ai/ViPO)

点击查看摘要

Abstract:While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.

72. 【2604.24952】Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

链接https://arxiv.org/abs/2604.24952

作者:Xinxin Liu,Ming Li,Zonglin Lyu,Yuzhang Shang,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:encompassing aesthetics, detail fidelity, Direct Preference Optimization, Diffusion Direct Preference, Human visual preferences

备注

点击查看摘要

Abstract:Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: this https URL

73. 【2604.24947】Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

链接https://arxiv.org/abs/2604.24947

作者:Cheng-Han Lee,Maniratnam Mandal,Neil Birkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse handheld display, handheld display resolutions, mobile video consumption, ratios poses challenges, video

备注: Under Review in IEEE Transactions on Image Processing. The code, models and dataset will be available at: [this https URL](https://github.com/steven413d/LIVE-YT-VideoCropping)

点击查看摘要

Abstract:With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video's intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.

74. 【2604.24921】Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

链接https://arxiv.org/abs/2604.24921

作者:Yifei Wei,Linqing Zhong,Yi Liu,Yuxiang Lu,Xindong He,Maoqing Yao,Guanghui Ren

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:executable physical actions, high-level semantic instructions, generalist robotic manipulation, instructions into executable, executable physical

备注: Accepted to the Main Conference of ACL 2026. Project page: [this https URL](https://libra-vla.github.io/)

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are a promising paradigm for generalist robotic manipulation by grounding high-level semantic instructions into executable physical actions. However, prevailing approaches typically adopt a monolithic generation paradigm, directly mapping visual-linguistic features to high-frequency motor commands in a flat, non-hierarchical fashion. This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. We explicitly decouple the learning complexity into a coarse-to-fine hierarchy to strike a training equilibrium, while simultaneously leveraging this structural modularity to implement an asynchronous execution strategy. The Semantic Planner predicts discrete action tokens capturing macro-directional intent, while the Action Refiner conditions on coarse intent to generate high-frequency continuous actions for precise alignment. Crucially, our empirical analysis reveals that performance follows an inverted-U curve relative to action decomposition granularity, peaking exactly when the learning difficulty is balanced between the two sub-systems. With the asynchronous design, our approach offers a scalable, robust, and responsive solution for open-world manipulation.

75. 【2604.24919】Agentic AI for Remote Sensing: Technical Challenges and Research Directions

链接https://arxiv.org/abs/2604.24919

作者:Muhammad Akhtar Munir,Muhammad Umer Sheikh,Akashah Shabbir,Muhammad Haris Khan,Fahad Khan,Xiao Xiang Zhu,Begum Demir,Salman Khan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth Observation, require coordinated reasoning, moving beyond static, static prediction, multi-step analytical workflows

备注: 31 pages. Position Paper

点击查看摘要

Abstract:Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

76. 【2604.24894】VISION-SLS: Safe Perception-Based Control from Learned Visual Representations via System Level Synthesis

链接https://arxiv.org/abs/2604.24894

作者:Antoine P. Leeman,Shuyu Zhan,Melanie N. Zeilinger,Glen Chou

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)

关键词:high-resolution RGB images, System Level Synthesis, nonlinear output-feedback control, high-resolution RGB, propose VISION-SLS

备注: Extended version; conference version to appear in Robotics: Science and Systems XXII (RSS 2026)

点击查看摘要

Abstract:We propose VISION-SLS, a method for nonlinear output-feedback control from high-resolution RGB images which provides robust constraint satisfaction guarantees under calibrated uncertainty bounds despite partial observability, sensor noise, and nonlinear dynamics. To enable scalability while retaining guarantees, we propose: (i) a learned low-dimensional observation map from pretrained visual features with state-dependent error bounds, and (ii) a causal affine time-varying output-feedback policy optimized via System Level Synthesis (SLS). We develop a scalable, novel solver for the resulting nonconvex program that leverages sequential convex programming coupled with efficient Riccati recursions. On two simulated visuomotor tasks (a 4D car and a 10D quadrotor) with = 512 x 512 pixels and a 59D humanoid task with partial observability, our method enables safe, information-gathering behavior that reduces uncertainty while guaranteeing constraint satisfaction with empirically-calibrated error bounds. We also validate our method on hardware, safely controlling a ground vehicle from onboard images, outperforming baselines in safety rate and solve times. Together, these results show that learned visual abstractions coupled with an efficient solver make SLS-based safe visuomotor output-feedback practical at scale. The code implementation of our method is available at this https URL.

77. 【2604.24893】Interactive Episodic Memory with User Feedback

链接https://arxiv.org/abs/2604.24893

作者:Nikesh Subedi,Loris Bazzani,Ziad Al-Halah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long egocentric video, natural language queries, egocentric video, episodic memory, natural language

备注: Accepted to CVPR 2026. Project Page: [this https URL](https://nsubedi11.github.io/refocus)

点击查看摘要

Abstract:In episodic memory with natural language queries (EM-NLQ), a user may ask a question (e.g., "Where did I place the mug?") that requires searching a long egocentric video, captured from the user's perspective, to find the moment that answers it. However, queries can be ambiguous or incomplete, leading to incorrect responses. Current methods ignore this key aspect and address EM-NLQ in a one-shot setup, limiting their applicability in real-world scenarios. In this work, we address this gap and introduce the Episodic Memory with Questions and Feedback task (EM-QnF). Here, the user can provide feedback on the model's initial prediction or add more information (e.g., "Before this. I'm looking for the big blue mug not the white one"), helping the model refine its predictions interactively. To this end, we collect datasets for feedback-based interaction and propose a lightweight training scheme that avoids expensive sequential optimization. We also introduce a plug-and-play Feedback ALignment Module (FALM) that enables existing EM-NLQ models to incorporate user feedback effectively. Our approach significantly improves over the state of the art on three challenging benchmarks and is better than or competitive with commercial large vision-language models while remaining efficient. Evaluation with human-generated feedback shows that it generalizes well to real-world scenarios.

78. 【2604.24885】VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

链接https://arxiv.org/abs/2604.24885

作者:Maitreya Patel,Jingtao Li,Weiming Zhuang,Yezhou Yang,Lingjuan Lv

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:image synthesis approach, aspect ratios, narrowing the gap, synthesis approach, approach that generalizes

备注: Accepted at CVPR'26 | Project Page: [this https URL](https://github.com/SonyResearch/VibeToken)

点击查看摘要

Abstract:We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

79. 【2604.24877】Learning Illumination Control in Diffusion Models

链接https://arxiv.org/abs/2604.24877

作者:Nishit Anand,Manan Suri,Christopher Metzler,Dinesh Manocha,Ramani Duraiswami

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:visual content creation, Controlling illumination, content creation, essential for photography, photography and visual

备注: Accepted to ICLR 2026 ReALM-GEN Workshop on Diffusion Models. Project Website: [this https URL](https://nishitanand.github.io/relighting-diffusion-website)

点击查看摘要

Abstract:Controlling illumination in images is essential for photography and visual content creation. While closed-source models have demonstrated impressive illumination control, open-source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open-source and reproducible pipeline for learning illumination control in diffusion models. Our approach builds a data engine that transforms well-lit images into supervised training triplets consisting of a poorly-illuminated input image, a natural language lighting instruction, and a well-illuminated output image. We finetune a diffusion model on this data and demonstrate significant improvements over baseline SD 1.5, SDXL, and FLUX.1-dev models in perceptual similarity, structural similarity, and identity preservation. Our work provides a reproducible solution built entirely with open-source tools and publicly available data. We release all our code, data, and model weights publicly.

80. 【2604.24876】ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

链接https://arxiv.org/abs/2604.24876

作者:Yu Xin,Gorkem Can Ates,Jun Ma,Sumin Kim,Ying Zhang,Kaleb E Smith,Kuang Gong,Wei Shao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:spatial prompt based, prompt based models, medical image segmentation, image segmentation offers, medical image

备注

点击查看摘要

Abstract:Text guided 3D medical image segmentation offers a flexible alternative to class based and spatial prompt based models by allowing users to specify regions of interest directly in natural language. This paradigm avoids reliance on predefined label sets, reduces ambiguous outputs, and aligns more naturally with clinical workflows. However, existing text guided frameworks are often computationally expensive, exhibit weak text volume feature alignment, and fail to capture fine anatomical details. We propose ESICA, a lightweight and scalable framework that addresses these challenges through three innovations: (1) a similarity matrix based mask prediction formulation that enhances semantic alignment, (2) an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and (3) a two pass refinement strategy that sharpens boundaries and resolves uncertain regions. To improve training stability and generalization, ESICA adopts a two stage scheme consisting of positive only pretraining followed by balanced fine tuning. On the CVPR BiomedSegFM benchmark spanning five imaging modalities (CT, MRI, PET, ultrasound, and microscopy), ESICA achieves state of the art segmentation accuracy, while the compact ESICA4 Lite variant attains similar segmentation performance with substantially fewer parameters, yielding a superior efficiency accuracy trade off. Our framework advances text guided segmentation toward efficient, scalable, and clinically deployable systems. Code will be made publicly available at this https URL.

81. 【2604.24767】Automated detection of pediatric congenital heart disease from phonocardiograms using deep and handcrafted feature fusion

链接https://arxiv.org/abs/2604.24767

作者:Abdul Jabbar,Ethan Grooby,Yang Yi Poh,Khawza I. Ahmad,Md Hassanuzzaman,Raqibul Mostafa,Ahsan H. Khandoker,Faezeh Marzbanrad

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Congenital heart disease, live births worldwide, Congenital heart, birth defect, births worldwide

备注: 9 Pages, 5 figures. Computers in Biology and Medicine, 2025

点击查看摘要

Abstract:Congenital heart disease (CHD) is the most common type of birth defect, impacting about 1% of live births worldwide. Echocardiography, the gold-standard diagnostic method, is costly and inaccessible in low-resource settings. Diagnosis is delayed due to limited skilled experts, whose ability to interpret pathological patterns varies significantly, causing inter- and intra-clinician variability. Therefore, we present a new method for a more accessible diagnostic modality, the digital stethoscope, to detect CHDs. Our method is based on deep feature fusion, integrating deep and handcrafted features for the automated early detection of CHDs. For this work, Phonocardiography (PCG) recordings were obtained from 751 pediatric subjects (Age:1 month- 16 years) in Bangladesh, ranging from infants to adults at four auscultation locations: mitral valve (MV), aortic valve (AV), pulmonary valve (PV), and tricuspid valve (TV). These recordings were labeled based on confirmed diagnoses by cardiologists as either cases of CHD or non-CHD. The results demonstrated that our proposed model achieved an accuracy of 92%, a sensitivity of 91%, and a specificity of 91%, based on a patient-wise split of 70% training, 20% validation, and 10% testing. Furthermore, the Area Under the Receiver Operating Characteristic curve (AUROC) of 96%, and an F1-score of 92%. This model promises efficient real-time remote detection of CHDs as a cost-effective screening tool for low-resource settings.

82. 【2604.25884】QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

链接https://arxiv.org/abs/2604.25884

作者:Shuxiang Cao,Zijian Zhang,Abhishek Agarwal,Grace Bratrud,Niyaz R. Beysengulov,Daniel C. Cole,Alejandro Gómez Frieiro,Elena O. Glen,Hao Hsu,Gang Huang,Raymond Jow,Greshma Shaji,Tom Lubowe,Ligeng Zhu,Luis Mantilla Calderón,Nicola Pancotti,Joel Pendleton,Brandon Severin,Charles Etienne Staub,Sara Sussman,Antti Vepsäläinen,Neel Rajeshbhai Vora,Yilun Xu,Varinia Bernales,Daniel Bowring,Elica Kyoseva,Ivan Rungger,Giulia Semeghini,Sam Stanwyck,Timothy Costa,Alán Aspuru-Guzik,Krysta Svore

类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词:interpreting experimental data, universal human-readable representation, systematic evaluation exists, computing calibration depends, Quantum computing calibration

备注

点击查看摘要

Abstract:Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

83. 【2604.25755】Quantum-Inspired Robust and Scalable SAR Object Classification

链接https://arxiv.org/abs/2604.25755

作者:Maximilian Scharf,Marco Trenti,Felix Bock,Padraig Davidson,Tobias Brosch,Benjamin Rodrigues de Miranda,Sigurd Huber,Timo Felser

类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)

关键词:high dynamic range, image classification naturally, requiring robust classification, SAR image classification, deal with huge

备注: 6 pages, 6 figures, EUSAR 2026 conference

点击查看摘要

Abstract:SAR image classification naturally has to deal with huge noise and a high dynamic range particularly requiring robust classification models. Additionally, the deployment of these models on edge devices, such as drones and military aircraft, requires a careful balance between model size and classification accuracy. This study explores the potential of tensor networks to meet these robustness requirements, specifically evaluating their resilience to data poisoning. Unlike previous works that concentrated on conventional neural networks for SAR object detection, this research focuses on the robustness and model reduction capabilities of tensor networks in object classification. Our findings indicate that tensor networks are adept at addressing both the challenges of robustness and the need for model efficiency, thereby contributing valuable insights to the ongoing discourse in radar applications and deep learning methodologies in general.

84. 【2604.25685】Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment

链接https://arxiv.org/abs/2604.25685

作者:Sanghati Basu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains insufficiently quantified, demonstrated strong generalization, clinically realistic medical, shifts remains insufficiently, insufficiently quantified

备注: 8 Pages, 5 Tables, 2 Figures

点击查看摘要

Abstract:Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulating inter-scanner variability, including Gaussian noise, blur, contrast scaling, gamma correction, and resolution mismatch, were applied across ten conditions. The clean baseline achieved a mean Dice score of 0.9145 (95% CI: [0.909, 0.919]) with a failure rate of 0.67%. Across all perturbations, the absolute mean {\Delta}Dice remained below 0.01. Paired Wilcoxon signed-rank tests with Benjamini-Hochberg false discovery rate correction identified statistically significant but small-magnitude changes under selected conditions, while McNemar analysis showed no significant increase in failure probability. These findings indicate that SAM exhibits stable segmentation behavior under moderate CT domain shifts, supporting its role as a robust foundation baseline for medical image segmentation research. As health digital twins increasingly incorporate foundation segmentation models for anatomical modeling and organ-level monitoring, formal characterization of robustness under real-world imaging variability is a necessary step toward trustworthy deployment.

85. 【2604.25371】PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

链接https://arxiv.org/abs/2604.25371

作者:Kaikwan Lau,Gary P. T. Choi

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)

关键词:extreme data scarcity, computational evolutionary biology, shapes respect phylogenetic, respect phylogenetic relationships, three-dimensional morphological structures

备注

点击查看摘要

Abstract:Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson $r=0.993$); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

86. 【2604.24793】CRC-SAM: SAM-Based Multi-Modal Segmentation and Quantification of Colorectal Cancer in CT, Colonoscopy, and Histology Images

链接https://arxiv.org/abs/2604.24793

作者:Daniel Lao

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:histopathology images, present CRC-SAM, unified framework, Abstract, CRC-SAM

备注: 4 pages, 3 figures, ISBI 2026 oral presentation

点击查看摘要

Abstract:We present CRC-SAM, a unified framework for colorectal cancer segmentation across colonoscopy, CT, and histopathology images. Unlike prior single-modality methods, CRC-SAM provides consistent, modality-agnostic segmentation throughout the clinical workflow. Built on MedSAM, it incorporates low-rank adaptation (LoRA) layers into a frozen encoder, enabling efficient domain transfer to underrepresented modalities with minimal trainable parameters. Experiments on MSD-Colon, CVC-ClinicDB, and EBHI-Seg demonstrate superior performance across modalities, outperforming state-of-the-art baselines and highlighting the effectiveness of lightweight LoRA adaptation for foundation-model-based colorectal cancer analysis.