本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新593篇论文,其中:

  • 自然语言处理109
  • 信息检索16
  • 计算机视觉106

自然语言处理

1. 【2604.20842】SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

链接https://arxiv.org/abs/2604.20842

作者:Ruohan Liu,Shukang Yin,Tao Wang,Dong Zhang,Weiji Zhuang,Shuhuai Ren,Ran He,Caifeng Shan,Chaoyou Fu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:natural human-computer interaction, Large Audio-Language Models, Large Audio-Language, human-computer interaction, remains limited

备注: Project page: [this https URL](https://speechparaling-bench.github.io/)

点击查看摘要

Abstract:Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

2. 【2604.20835】Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

链接https://arxiv.org/abs/2604.20835

作者:Zhaofeng Wu,Shiqi Wang,Boya Peng,Anuj Goyal,Melanie Kambadur,Sebastian Ruder,Yoon Kim,Chloe Bi

类目:Computation and Language (cs.CL)

关键词:Modern language models, impressive coding capabilities, Modern language, common programming languages, demonstrate impressive coding

备注

点击查看摘要

Abstract:Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.

3. 【2604.20833】AVISE: Framework for Evaluating the Security of AI Systems

链接https://arxiv.org/abs/2604.20833

作者:Mikko Lempinen,Joni Kemppainen,Niklas Raesalmi

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:pose growing risks, consequential system failures, vulnerabilities pose growing, security vulnerabilities pose, Security Evaluation

备注

点击查看摘要

Abstract:As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases and an Evaluation Language Model (ELM) that determines whether each test case was able to jailbreak the target model, achieving 92% accuracy, an F1-score of 0.91, and a Matthews correlation coefficient of 0.83. We evaluate nine recently released language models of diverse sizes with the SET and find that all are vulnerable to the augmented Red Queen attack to varying degrees. AVISE provides researchers and industry practitioners with an extensible foundation for developing and deploying automated SETs, offering a concrete step toward more rigorous and reproducible AI security evaluation.

4. 【2604.20817】Convergent Evolution: How Different Language Models Learn Similar Number Representations

链接https://arxiv.org/abs/2604.20817

作者:Deqing Fu,Tianyi Zhou,Mikhail Belkin,Vatsal Sharan,Robin Jia

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural text learn, geometrically separable features, natural text, Fourier domain, geometrically separable

备注

点击查看摘要

Abstract:Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

5. 【2604.20806】OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

链接https://arxiv.org/abs/2604.20806

作者:Qiguang Chen,Chengyu Luan,Jiajun Wu,Qiming Yu,Yi Yang,Yizhuo Li,Jingqi Tong,Xiachong Feng,Libo Qin,Wanxiang Che

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large vision-language models, made substantial advances, Large vision-language, made substantial, substantial advances

备注: ACL 2026 Camera Ready

点击查看摘要

Abstract:Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

6. 【2604.20791】Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

链接https://arxiv.org/abs/2604.20791

作者:Mariano Barone,Francesco Di Serio,Roberto Moio,Marco Postiglione,Giuseppe Riccio,Antonio Romano,Vincenzo Moscato

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, remains insufficiently quantified, standards remains insufficiently, clinical standards remains

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

7. 【2604.20789】Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

链接https://arxiv.org/abs/2604.20789

作者:Pranava Madhyastha,Dagmar Adamcova

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:temporal decay based, human-like working memory, Transformer architecture, including fixed-width windows, based attention mechanisms

备注: Published in ACL 2026 Findings track

点击查看摘要

Abstract:We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.

8. 【2604.20738】RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

链接https://arxiv.org/abs/2604.20738

作者:Marisa Hudspeth,Patrick J. Burns,Brendan O'Connor

类目:Computation and Language (cs.CL)

关键词:English settings, bilingual Latin, question-answer pairs, Latin pedagogical sources, Latin

备注: Published in LREC 2026

点击查看摘要

Abstract:We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: this https URL

9. 【2604.20732】Anchor-and-Resume Concession Under Dynamic Pricing for LLM-Augmented Freight Negotiation

链接https://arxiv.org/abs/2604.20732

作者:Hoang Nguyen,Lu Wang,Marta Gaia Bras

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Freight brokerages negotiate, revise targets mid-conversation, frequently revise targets, Freight brokerages, models frequently revise

备注

点击查看摘要

Abstract:Freight brokerages negotiate thousands of carrier rates daily under dynamic pricing conditions where models frequently revise targets mid-conversation. Classical time-dependent concession frameworks use a fixed shape parameter $\beta$ that cannot adapt to these updates. Deriving $\beta$ from the live spread enables adaptation but introduces a new problem: a pricing shift can cause the formula to retract a previous offer, violating monotonicity. LLM-powered brokers offer flexibility but require expensive reasoning models, produce non-deterministic pricing, and remain vulnerable to prompt injection. We propose a two-index anchor-and-resume framework that addresses both limitations. A spread-derived $\beta$ maps each load's margin structure to the correct concession posture, while the anchor-and-resume mechanism guarantees monotonically non-decreasing offers under arbitrary pricing shifts. All pricing decisions remain in a deterministic formula; the LLM, when used, serves only as a natural-language translation layer. Empirical evaluation across 115,125 negotiations shows that the adaptive $\beta$ tailors behavior by regime: in narrow spreads, it concedes quickly to prioritize deal closure and load coverage; in medium and wide spreads, it matches or exceeds the best fixed-$\beta$ baselines in broker savings. Against an unconstrained 20-billion-parameter LLM broker, it achieves similar agreement rates and savings. Against LLM-powered carriers as more realistic stochastic counterparties, it maintains comparable savings and higher agreement rates than against rule-based opponents. By decoupling the LLM from pricing logic, the framework scales horizontally to thousands of concurrent negotiations with negligible inference cost and transparent decision-making.

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.7

Cite as:
arXiv:2604.20732 [cs.MA]

(or
arXiv:2604.20732v1 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2604.20732

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2604.20726】Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

链接https://arxiv.org/abs/2604.20726

作者:Mohamed Hesham Elganayni,Runsheng Chen,Sebastian Nagl,Matthias Grabmair

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:free text legal, legal question answering, text legal question, evaluations of free, work explores

备注: Accepted at the 21st International Conference on Artificial Intelligence and Law (ICAIL 2026), Singapore, June 8-12, 2026. 10 pages, 14 figures, 2 tables

点击查看摘要

Abstract:This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at this https URL.

11. 【2604.20720】COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

链接https://arxiv.org/abs/2604.20720

作者:Noah Flynn

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:exhibit performance disparities, frequently degrading performance, degrading performance due, Large language models, fine-tuning frequently degrading

备注

点击查看摘要

Abstract:Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.

12. 【2604.20677】Intersectional Fairness in Large Language Models

链接https://arxiv.org/abs/2604.20677

作者:Chaima Boufaied,Ronnie De Souza Santos,Ann Barcomb

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, intersectional demographic attributes, socially sensitive settings, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

13. 【2604.20666】ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.20666

作者:Ioannis E. Livieris,Athanasios Koursaris,Alexandra Apostolopoulou,Konstantinos Kanaris Dimitris Tsakalidis,George Domalis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:English applications requires, Effective retrieval-augmented generation, applications requires embedding, Effective retrieval-augmented, English embedding model

备注: This paper has been accepted for presentation at Engineering Applications and Advances of Artificial Intelligence 2026 (EAAAI'26)

点击查看摘要

Abstract:Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.

14. 【2604.20658】Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

链接https://arxiv.org/abs/2604.20658

作者:Shivani Kumar,Adarsh Bharathwaj,David Jurgens

类目:Computation and Language (cs.CL)

关键词:large language models, reasoning and problem-solving, Multi-agent systems built, large language, increasingly deployed

备注

点击查看摘要

Abstract:Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

15. 【2604.20601】Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

链接https://arxiv.org/abs/2604.20601

作者:Zoya Volovikova,Nikita Sorokin,Dmitriy Lukashevskiy,Aleksandr Panov,Alexey Skrynnik

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:instruction-following tasks, introduce SuperIgor, language model, Abstract, manual dataset annotation

备注

点击查看摘要

Abstract:We introduce SuperIgor, a framework for instruction-following tasks. Unlike prior methods that rely on predefined subtasks, SuperIgor enables a language model to generate and refine high-level plans through a self-learning mechanism, reducing the need for manual dataset annotation. Our approach involves iterative co-training: an RL agent is trained to follow the generated plans, while the language model adapts and modifies these plans based on RL feedback and preferences. This creates a feedback loop where both the agent and the planner improve jointly. We validate our framework in environments with rich dynamics and stochasticity. Results show that SuperIgor agents adhere to instructions more strictly than baseline methods, while also demonstrating strong generalization to previously unseen instructions.

16. 【2604.20598】Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

链接https://arxiv.org/abs/2604.20598

作者:Naizhong Xu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)

关键词:Modern retrieval-augmented generation, Modern retrieval-augmented, systems treat vector, context-free artifacts, retrieval-augmented generation

备注: 17 pages, 4 tables

点击查看摘要

Abstract:Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.

17. 【2604.20582】rust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents

链接https://arxiv.org/abs/2604.20582

作者:Suveen Ellawela

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:study emergent social, emergent social dynamics, Avalon, Resistance, emergent social

备注

点击查看摘要

Abstract:We study emergent social dynamics in LLM agents playing The Resistance: Avalon, a hidden-role deception game. Unlike prior work on single-game performance, our agents play repeated games while retaining memory of previous interactions, including who played which roles and how they behaved, enabling us to study how social dynamics evolve. Across 188 games, two key phenomena emerge. First, reputation dynamics emerge organically when agents retain cross-game memory: agents reference past behavior in statements like "I am wary of repeating last game's mistake of over-trusting early success." These reputations are role-conditional: the same agent is described as "straightforward" when playing good but "subtle" when playing evil, and high-reputation players receive 46% more team inclusions. Second, higher reasoning effort supports more strategic deception: evil players more often pass early missions to build trust before sabotaging later ones, 75% in high-effort games vs 36% in low-effort games. Together, these findings show that repeated interaction with memory gives rise to measurable reputation and deception dynamics among LLM agents.

18. 【2604.20572】Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

链接https://arxiv.org/abs/2604.20572

作者:Yuxuan Cai,Jie Zhou,Qin Chen,Liang He

类目:Computation and Language (cs.CL)

关键词:retrieval, Online lifelong learning, long-horizon tasks, experience, Experience-Enhanced Online Evolution

备注

点击查看摘要

Abstract:Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.

19. 【2604.20564】Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

链接https://arxiv.org/abs/2604.20564

作者:Seunghyun Park,Yuanyuan Lei

类目:Computation and Language (cs.CL)

关键词:multi-step logical deduction, impressive reasoning capabilities, single transition error, LLMs demonstrate impressive, demonstrate impressive reasoning

备注

点击查看摘要

Abstract:While LLMs demonstrate impressive reasoning capabilities, they remain fragile in multi-step logical deduction, where a single transition error can propagate through the entire reasoning chain, leading to unstable performance. In this work, we identify logical connectives as primary points of this structural fragility. Through empirical analysis, we show that connective tokens function as high entropy forking points, at which models frequently struggle to determine the correct logical direction. Motivated by this observation, we hypothesize that intervening in logical connective selection can guide LLMs toward more correct logical direction, thereby improving the overall reasoning chain. To validate this hypothesis, we propose a multi-layered framework that intervenes specifically at these logic-critical junctions in the reasoning process. Our framework includes (1) Gradient-based Logical Steering to guide LLMs internal representations towards valid reasoning subspaces, (2) Localized Branching to resolve ambiguity via targeted look-ahead search, and (3) Targeted Transition Preference Optimization, a surgical reinforcement learning objective that selectively optimizes single-token preferences at logical pivots. Crucially, by concentrating intervention solely on logic-critical transitions, our framework achieves a favorable accuracy--efficiency trade-off compared to global inference time scaling methods like beam search and self-consistency.

20. 【2604.20560】LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

链接https://arxiv.org/abs/2604.20560

作者:Serhii Zabolotnii

类目:Computation and Language (cs.CL)

关键词:Case Report Forms, Automatically filling Case, filling Case Report, Report Forms, Case Report

备注: 16 pages, 1 figure, 5 tables. Preprint of a paper accepted to the Third Workshop on Patient-oriented Language Processing (CL4Health), co-located with LREC-COLING 2026

点击查看摘要

Abstract:Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.

21. 【2604.20556】LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

链接https://arxiv.org/abs/2604.20556

作者:Yuhang Wu,Qinyuan Liu,Qiuyang Zhao,Qingwei Chong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, including traditional Transformer, Large Language, diversified architectural landscape, traditional Transformer

备注: 5 pages, 3 figures

点击查看摘要

Abstract:Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.

22. 【2604.20549】oward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

链接https://arxiv.org/abs/2604.20549

作者:Yassine Turki,Vinko Sabolčec,Bettina Messmer,Martin Jaggi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, ratio by performing, curation has shifted, shifted from maximizing

备注: Accepted at the 3rd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM @ ICLR 2026). 31 pages, 4 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

23. 【2604.20548】Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

链接https://arxiv.org/abs/2604.20548

作者:Shuai Chen,Chengzhi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词

备注: Scientometrics

点击查看摘要

None

24. 【2604.20535】Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

链接https://arxiv.org/abs/2604.20535

作者:Hawau Olamide Toyin,Mutiah Apampa,Toluwani Aremu,Humaid Alblooshi,Ana Rita Valente,Gonçalo Leal,Zhengjun Yue,Zeerak Talat,Hanan Aldarmaki

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:limited interdisciplinary dialogue, receiving greater attention, Atypical speech, interdisciplinary dialogue, speech technology research

备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Atypical speech is receiving greater attention in speech technology research, but much of this work unfolds with limited interdisciplinary dialogue. For stuttered speech in particular, it is widely recognised that current speech recognition systems fall short in practice, and current evaluation methods and research priorities are not systematically grounded in end-user experiences and needs. In this work, we analyse these gaps through 1) a scoping review of papers that deal with stuttered speech and 2) a survey of 70 stakeholders, including adults who stutter and speech-language pathologists. By analysing these two perspectives, we propose a taxonomy of stuttered-speech research, identify where current research directions diverge from the needs articulated by stakeholders, and conclude by outlining concrete guidelines and directions towards addressing the real needs of the stuttering community.

25. 【2604.20531】Effects of Cross-lingual Evidence in Multilingual Medical Question Answering

链接https://arxiv.org/abs/2604.20531

作者:Anar Yeginbergen,Maite Oronoz,Rodrigo Agerri

类目:Computation and Language (cs.CL)

关键词:Medical Question Answering, Question Answering, Multilingual Medical Question, Medical Question, paper investigates Multilingual

备注

点击查看摘要

Abstract:This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage

26. 【2604.20511】CHASM: Unveiling Covert Advertisements on Chinese Social Media

链接https://arxiv.org/abs/2604.20511

作者:Jingyi Zheng,Tianyi Hu,Yule Liu,Zhen Sun,Zongmin Zhang,Zifan Peng,Wenhan Dong,Xinlei He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:large language models, evaluating large language, Multimodal Large Language, moderation completely overlook, language models

备注: NeuIPS 2025 (Datasets and Benchmarks Track)

点击查看摘要

Abstract:Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly this http URL results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert this http URL further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual this http URL provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.

27. 【2604.20487】Knowledge Capsules: Structured Nonparametric Memory Units for LLMs

链接https://arxiv.org/abs/2604.20487

作者:Bin Ju,Shenfeng Weng,Danying Zhou,Kunkai Su,Rongkai Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, parametric weights, making it costly, extend without retraining

备注

点击查看摘要

Abstract:Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.

28. 【2604.20468】MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

链接https://arxiv.org/abs/2604.20468

作者:Markus Knauer,Edoardo Fiorini,Maximilian Mühlbauer,Stefan Schneyer,Promwat Angsuratanawech,Florian Samuel Lay,Timo Bachmann,Samuel Bustamante,Korbinian Nottensteiner,Freek Stulp,Alin Albu-Schäffer,João Silvério,Thomas Eiband

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:applications require increasingly, require increasingly flexible, increasingly flexible systems, robot applications require, tasks and environments

备注: 15 pages, 13 figures, 3 tables

点击查看摘要

Abstract:Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.

29. 【2604.20462】Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin

链接https://arxiv.org/abs/2604.20462

作者:Ali Hassaan Mughal,Noor Fatima,Muhammad Bilal

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:suites accumulate step-text, accumulate step-text duplication, Behaviour-Driven Development, Toggle, Toggle Hugging Face

备注: 39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at [this https URL](https://github.com/amughalbscs16/cukereuse-release) under Apache-2.0

点击查看摘要

Abstract:Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.

Comments:
39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at this https URL under Apache-2.0

Subjects:

Software Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)

ACMclasses:
D.2.5; D.2.7; I.2.7

Cite as:
arXiv:2604.20462 [cs.SE]

(or
arXiv:2604.20462v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.20462

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ali Hassaan Mughal [view email] [v1]
Wed, 22 Apr 2026 11:44:05 UTC (240 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin, by Ali Hassaan Mughal and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.SE

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.CL
cs.IR

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

30. 【2604.20454】Not all ANIMALs are equal: metaphorical framing through source domains and semantic frames

链接https://arxiv.org/abs/2604.20454

作者:Yulia Otmakhova,Matteo Guida,Lea Frermann

类目:Computation and Language (cs.CL)

关键词:source domains, powerful framing devices, semantic frames, fully explain, explain the specific

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Metaphors are powerful framing devices, yet their source domains alone do not fully explain the specific associations they evoke. We argue that the interplay between source domains and semantic frames determines how metaphors shape understanding of complex issues, and present a computational framework that allows to derive salient discourse metaphors through their source domains and semantic frames. Applying this framework to climate change news, we uncover not only well-known source domains but also reveal nuanced frame-level associations that distinguish how the issue is portrayed. In analyzing immigration discourse across political ideologies, we demonstrate that liberals and conservatives systematically employ different semantic frames within the same source domains, with conservatives favoring frames emphasizing uncontrollability and liberals choosing neutral or more ``victimizing'' semantic frames. Our work bridges conceptual metaphor theory and linguistics, providing the first NLP approach for discovery of discourse metaphors and fine-grained analysis of differences in metaphorical framing. Code, data and statistical scripts are available at this https URL.

31. 【2604.20452】HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

链接https://arxiv.org/abs/2604.20452

作者:Peng Peng,Weiwei Lin,Wentai Wu,Xinyang Wang,Yongheng Liu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, Retrieval-Augmented Generation, retrieving external documents, language models, boundary of large

备注: Accepted by ICDE 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: this https URL.

32. 【2604.20447】Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

链接https://arxiv.org/abs/2604.20447

作者:Andrea Maracani,Savas Ozkan,Junyi Zhu,Sinan Mutlu,Mete Ozay

类目:Computation and Language (cs.CL)

关键词:Named Entity Recognition, Named Entity, Entity Recognition, information extraction pipelines, industrial information extraction

备注

点击查看摘要

Abstract:Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy-efficiency trade-off suitable for high-volume serving and on-device applications.

33. 【2604.20443】DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

链接https://arxiv.org/abs/2604.20443

作者:Neemesh Yadav,Palakorn Achananuparp,Jing Jiang,Ee-Peng Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Theory of Mind, possess Theory, Language Models

备注: Submitted to KDD 2026 Datasets and Benchmarks Track

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at this https URL.

34. 【2604.20398】WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

链接https://arxiv.org/abs/2604.20398

作者:Juyong Jiang,Chenglin Cai,Chansung Park,Jiasi Shen,Sunghun Kim,Jianguo Li,Yue Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:remain highly challenging, Large Language Models, websites remain highly, Large Language, Language Models

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.

35. 【2604.20382】Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs

链接https://arxiv.org/abs/2604.20382

作者:Aishik Mandal,Hiba Arnaout,Clarissa W. Ong,Juliet Bockhorst,Kate Sheehan,Rachael Moldow,Tanmoy Chakraborty,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:Large Language Models, mental health support, Large Language, Rising demand, Language Models

备注: 49 pages, 46 figures, 11 tables

点击查看摘要

Abstract:Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $\alpha$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.

36. 【2604.20357】SignDATA: Data Pipeline for Sign Language Translation

链接https://arxiv.org/abs/2604.20357

作者:Kuanwei Chen,Tingyi Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:clip timing, signer framing, difficult to preprocess, preprocess consistently, vary in annotation

备注: 7 pages, 1 figure

点击查看摘要

Abstract:Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically this http URL is available at this https URL.

37. 【2604.20331】Surrogate modeling for interpreting black-box LLMs in medical predictions

链接https://arxiv.org/abs/2604.20331

作者:Changho Han(1),Songsoo Kim(2),Dong Won Kim(2),Leo Anthony Celi(3, 4 and 5),Jaewoong Kim(2),SungA Bae(6 and 7),Dukyong Yoon(2, 7 and 8) ((1) Medical Big Data Research Center, Seoul National University Medical Research Center, Seoul National University College of Medicine, Seoul, Republic of Korea, (2) Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea, (3) Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA, (4) Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA, (5) Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA, (6) Department of Cardiology, Yongin Severance Hospital, Yonsei University College of Medicine, Yongin, Republic of Korea, (7) Center for Digital Health, Yongin Severance Hospital, Yonsei University Health System, Yongin, Republic of Korea, (8) Institute for Innovation in Digital Healthcare, Severance Hospital, Seoul, Republic of Korea)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, black-box nature obscures, encode extensive real-world, trained on vast

备注

点击查看摘要

Abstract:Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

38. 【2604.20283】Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

链接https://arxiv.org/abs/2604.20283

作者:Mo Zhou,Jianwei Wang,Kai Wang,Helen Paik,Ying Zhang,Wenjie Zhang

类目:Computation and Language (cs.CL)

关键词:Multi-perspective Evidence Synthesis, Large Language Models, maps ambiguous mentions, evidence, Multi-perspective Evidence

备注

点击查看摘要

Abstract:Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: this https URL.

39. 【2604.20273】ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

链接https://arxiv.org/abs/2604.20273

作者:Jan-Philipp Schmidt

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:International Actuarial Association, advanced actuarial assessment, Education Syllabus, actuarial assessment items, Actuarial Association

备注: 19 pages, 4 figures, 4 tables

点击查看摘要

Abstract:We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at this https URL, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks -- 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge -- and report three headline findings. First, multi-agent verification is load-bearing: the independent verifier flags a majority of drafted items on first pass, most of which the one-shot repair loop resolves. Second, locally-hosted open-weights inference sits on the cost-performance Pareto front: a Gemma~4 model running on consumer hardware and a Cerebras-hosted 120B open-weights model dominate the near-zero-cost region, with the latter within one item of the top of the leaderboard. Third, MCQ and LLM-as-Judge rankings differ meaningfully: the MCQ scaffold inflates the performance ceiling, and Judge-mode evaluation is needed to discriminate at the frontier.

40. 【2604.20256】RADS: Reinforcement Learning-Based Sample Selection Improves Transfer Learning in Low-resource and Imbalanced Clinical Settings

链接https://arxiv.org/abs/2604.20256

作者:Wei Han,David Martinez,Anna Khanina,Lawrence Cavedon,Karin Verspoor

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:shot fine-tuning, success is highly, highly dependent, selected as training, Adaptive Domain Sampling

备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:A common strategy in transfer learning is few shot fine-tuning, but its success is highly dependent on the quality of samples selected as training examples. Active learning methods such as uncertainty sampling and diversity sampling can select useful samples. However, under extremely low-resource and class-imbalanced conditions, they often favor outliers rather than truly informative samples, resulting in degraded performance. In this paper, we introduce RADS (Reinforcement Adaptive Domain Sampling), a robust sample selection strategy using reinforcement learning (RL) to identify the most informative samples. Experimental evaluations on several real world clinical datasets show our sample selection strategy enhances model transferability while maintaining robust performance under extreme class imbalance compared to traditional methods.

41. 【2604.20244】Hybrid Policy Distillation for LLMs

链接https://arxiv.org/abs/2604.20244

作者:Wenhong Zhu,Ruobing Xie,Rui Wang,Pengfei Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:compressing large language, large language models, Knowledge distillation, Hybrid Policy Distillation, divergence direction

备注: WIP

点击查看摘要

Abstract:Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at this https URL.

42. 【2604.20241】Construction of a Battery Research Knowledge Graph using a Global Open Catalog

链接https://arxiv.org/abs/2604.20241

作者:Luca Foppiano,Sae Dieb,Malik Zain,Kazuki Kasama,Keitaro Sodeyama,Mikiko Tanifuji

类目:Computation and Language (cs.CL); Computational Physics (physics.comp-ph)

关键词:highly interdisciplinary field, track relevant expertise, identify potential collaborators, interdisciplinary field, rapidly growing

备注

点击查看摘要

Abstract:Battery research is a rapidly growing and highly interdisciplinary field, making it increasingly difficult to track relevant expertise and identify potential collaborators across institutional boundaries. In this work, we present a pipeline for constructing an author-centric knowledge graph of battery research built on OpenAlex, a large-scale open bibliographic catalogue. For each author, we derive a weighted research descriptors vector that combines coarse-grained OpenAlex concepts with fine-grained keyphrases extracted from titles and abstracts using KeyBERT with ChatGPT (gpt-3.5-turbo) as the backend model, selected after evaluating multiple alternatives. Vector components are weighted by research descriptor origin, authorship position, and temporal recency. The framework is applied to a corpus of 189,581 battery-related works. The resulting vectors support author-author similarity computation, community detection, and exploratory search through a browser-based interface. The knowledge graph is then serialized in RDF and linked to Wikidata identifiers, making it interoperable with external linked open data sources and extensible beyond the battery domain. Unlike prior author-centric analyses confined to institutional repositories, our approach operates at cross-institutional scale and grounds similarity in domain semantics rather than citation or co-authorship structure alone.

43. 【2604.20225】he GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

链接https://arxiv.org/abs/2604.20225

作者:Yilun Liu,Chunguang Zhao,Mengyao Piao,Lingqi Miao,Shimin Tao,Minggui He,Chenxin Liu,Li Zhang,Hongxia Ma,Jiaxin Guo,Chen Liu,Liqun Deng,Jiansheng Wei,Xiaojun Meng,Fanyi Du,Daimeng Wei,Yanghua Xiao

类目:Computation and Language (cs.CL)

关键词:Large Language Models, capabilities of Large, Language Models, Large Language, global utility

备注: Accepted by ACL 2026 main

点击查看摘要

Abstract:Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (this https URL).

44. 【2604.20221】Markov reads Pushkin, again: A statistical journey into the poetic world of Evgenij Onegin

链接https://arxiv.org/abs/2604.20221

作者:Angelo Maria Sabatini

类目:Computation and Language (cs.CL)

关键词:Evgenij Onegin-as captured, Evgenij Onegin-as, Onegin-as captured, time series analysis, applies symbolic time

备注: 21 pages, 7 figures, 3 supplementary files; revised version submitted to PLOS ONE

点击查看摘要

Abstract:This study applies symbolic time series analysis and Markov modeling to explore the phonological structure of Evgenij Onegin-as captured through a graphemic vowel/consonant (V/C) encoding-and one contemporary Italian translation. Using a binary encoding inspired by Markov's original scheme, we construct minimalist probabilistic models that capture both local V/C dependencies and large-scale sequential patterns. A compact four-state Markov chain is shown to be descriptively accurate and generative, reproducing key features of the original sequences such as autocorrelation and memory depth. All findings are exploratory in nature and aim to highlight structural regularities while suggesting hypotheses about underlying narrative dynamics. The analysis reveals a marked asymmetry between the Russian and Italian texts: the original exhibits a gradual decline in memory depth, whereas the translation maintains a more uniform profile. To further investigate this divergence, we introduce phonological probes-short symbolic patterns that link surface structure to narrative-relevant cues. Tracked across the unfolding text, these probes reveal subtle connections between graphemic form and thematic development, particularly in the Russian original. By revisiting Markov's original proposal of applying symbolic analysis to a literary text and pairing it with contemporary tools from computational statistics and data science, this study shows that even minimalist Markov models can support exploratory analysis of complex poetic material. When complemented by a coarse layer of linguistic annotation, such models provide a general framework for comparative poetics and demonstrate that stylized structural patterns remain accessible through simple representations grounded in linguistic form.

Comments:
21 pages, 7 figures, 3 supplementary files; revised version submitted to PLOS ONE

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.20221 [cs.CL]

(or
arXiv:2604.20221v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.20221

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
45. 【2604.20216】xt-to-Distribution Prediction with Quantile Tokens and Neighbor Context

链接https://arxiv.org/abs/2604.20216

作者:Yilun Zhu,Yuan Zhuang,Nikhita Vedula,Dushyanta Dhyani,Shaoyuan Xu,Moyan Li,Mohsen Bayati,Bryan Wang,Shervin Malmasi

类目:Computation and Language (cs.CL)

关键词:text regression require, regression require predicting, LLM-based text regression, full conditional distribution, quantile tokens

备注: Accepted to ACL 2026 main conference

点击查看摘要

Abstract:Many applications of LLM-based text regression require predicting a full conditional distribution rather than a single point value. We study distributional regression under empirical-quantile supervision, where each input is paired with multiple observed quantile outcomes, and the target distribution is represented by a dense grid of quantiles. We address two key limitations of current approaches: the lack of local grounding for distribution estimates, and the reliance on shared representations that create an indirect bottleneck between inputs and quantile outputs. In this paper, we introduce Quantile Token Regression, which, to our knowledge, is the first work to insert dedicated quantile tokens into the input sequence, enabling direct input-output pathways for each quantile through self-attention. We further augment these quantile tokens with retrieval, incorporating semantically similar neighbor instances and their empirical distributions to ground predictions with local evidence from similar instances. We also provide the first theoretical analysis of loss functions for quantile regression, clarifying which distributional objectives each optimizes. Experiments on the Inside Airbnb and StackSample benchmark datasets with LLMs ranging from 1.7B to 14B parameters show that quantile tokens with neighbors consistently outperform baselines (~4 points lower MAPE and 2x narrower prediction intervals), with especially large gains on smaller and more challenging datasets where quantile tokens produce substantially sharper and more accurate distributions.

46. 【2604.20200】Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows

链接https://arxiv.org/abs/2604.20200

作者:Hardy Chen,Nancy Lau,Haoqin Tu,Shuo Yan,Xiangyan Liu,Zijun Wang,Juncheng Wu,Michael Qizhe Shieh,Alvaro A. Cardenas,Cihang Xie,Yuyin Zhou

类目:Computation and Language (cs.CL)

关键词:supervise progress primarily, Frontier coding agents, public evaluation file, agent intermediate outputs, public score

备注: 25 pages

点击查看摘要

Abstract:Frontier coding agents are increasingly used in workflows where users supervise progress primarily through repeated improvement of a public score, namely the reported score on a public evaluation file with labels in the workspace, rather than through direct inspection of the agent's intermediate outputs. We study whether multi-round user pressure to improve that score induces public score exploitation: behavior that raises the public score through shortcuts without improving hidden private evaluation. We begin with a preliminary single-script tabular classification task, where GPT-5.4 and Claude Opus 4.6 both exploit label information within 10 rounds of user-agent interaction. We then build AgentPressureBench, a 34-task machine-learning repository benchmark spanning three input modalities, and collect 1326 multi-round trajectories from 13 coding agents. On our benchmark, we observe 403 exploitative runs, spanning across all tasks. We also find that stronger models have higher exploitation rates, supported by a significant Spearman rank correlation of 0.77. Our ablation experiments show that higher user pressure leads to earlier exploitation, reducing the average first exploit round by 15.6 rounds (i.e., 19.67 to 4.08). As a mitigation, adding explicit anti-exploit wordings in prompt mostly eliminates exploitation (100% to 8.3%). We hope that our work can bring attention to more careful use of coding agents workflow, and developing more robust coding agents under user pressure. Our project page is at this https URL .

47. 【2604.20199】All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

链接https://arxiv.org/abs/2604.20199

作者:Dan Wang,Guozhao Mo,Yafei Shi,Cheng Zhang,Bo Zheng,Boxi Cao,Xuanang Chen,Yaojie Lu,Hongyu Lin,Ben He,Xianpei Han,Le Sun

类目:Computation and Language (cs.CL)

关键词:ground Large Language, ground Large, Large Language Models, leverages cross-lingual evidence, Large Language

备注: ACL 2026 main conference

点击查看摘要

Abstract:Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.

48. 【2604.20183】Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

链接https://arxiv.org/abs/2604.20183

作者:Xinyu Zhang,Yuchen Wan,Boxuan Zhang,Zesheng Yang,Lingling Zhang,Bifan Wei,Jun Liu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, single problem admits, problem admits multiple, admits multiple related

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

49. 【2604.20168】Duluth at SemEval-2026 Task 6: DeBERTa with LLM-Augmented Data for Unmasking Political Question Evasions

链接https://arxiv.org/abs/2604.20168

作者:Shujauddin Syed,Ted Pedersen

类目:Computation and Language (cs.CL)

关键词:Unmasking Political Question, Political Question Evasions, Question Evasions, presents the Duluth, Duluth approach

备注

点击查看摘要

Abstract:This paper presents the Duluth approach to SemEval-2026 Task 6 on CLARITY: Unmasking Political Question Evasions. We address Task 1 (clarity-level classification) and Task 2 (evasion-level classification), both of which involve classifying question--answer pairs from U.S.\ presidential interviews using a two-level taxonomy of response clarity. Our system is based on DeBERTa-V3-base, extended with focal loss, layer-wise learning rate decay, and boolean discourse features. To address class imbalance in the training data, we augment minority classes using synthetic examples generated by Gemini 3 and Claude Sonnet 4.5. Our best configuration achieved a Macro F1 of 0.76 on the Task 1 evaluation set, placing 8th out of 40 teams. The top-ranked system (TeleAI) achieved 0.89, while the mean score across participants was 0.70. Error analysis reveals that the dominant source of misclassification is confusion between Ambivalent and Clear Reply responses, a pattern that mirrors disagreements among human annotators. Our findings demonstrate that LLM-based data augmentation can meaningfully improve minority-class recall on nuanced political discourse tasks.

50. 【2604.20166】Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

链接https://arxiv.org/abs/2604.20166

作者:Xin Sun,Yue Su,Yifan Mo,Qingyu Meng,Yuxuan Li,Saku Sugawara,Mengyuan Zhang,Charlotte Gerritsen,Sander L. Koole,Koen Hindriks,Jiahuan Pei

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:multiple disciplines, shared priority, mental health, mental health support, health support

备注

点击查看摘要

Abstract:Building trustworthy AI systems for mental health support is a shared priority across stakeholders from multiple disciplines. However, "trustworthy" remains loosely defined and inconsistently operationalized. AI research often focuses on technical criteria (e.g., robustness, explainability, and safety), while therapeutic practitioners emphasize therapeutic fidelity (e.g., appropriateness, empathy, and long-term user outcomes). To bridge the fragmented landscape, we propose a three-layer trust framework, covering human-oriented, AI-oriented, and interaction-oriented trust, integrating the viewpoints of key stakeholders (e.g., practitioners, researchers, regulators). Using this framework, we systematically review existing AI-driven research in mental health domain and examine evaluation practices for ``trustworthy'' ranging from automatic metrics to clinically validated approaches. We highlight critical gaps between what NLP currently measures and what real-world mental health contexts require, and outline a research agenda for building socio-technically aligned and genuinely trustworthy AI for mental health support.

51. 【2604.20148】Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

链接https://arxiv.org/abs/2604.20148

作者:Sachin Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:small language models, achieve strong tool-use, strong tool-use performance, small language, strong tool-use

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Can small language models achieve strong tool-use performance without complex adaptation mechanisms? This paper investigates this question through Meta-Tool, a controlled empirical study comparing hypernetwork-based LoRA adaptation against carefully designed few-shot prompting. Using a Llama-3.2-3B-Instruct backbone, we evaluate four adaptation mechanisms--few-shot prompting, documentation encoding, hypernetwork-generated LoRA weights, and value-guided beam search--across four diverse benchmarks: Gorilla APIBench, Spider 2.0, WebArena, and InterCode. Our central finding is a well-supported negative result: despite generating non-trivial weight matrices, the 227.8M-parameter hypernetwork provides no measurable improvement over few-shot prompting alone. Comprehensive ablation studies reveal that few-shot examples contribute +21.5% to performance and documentation contributes +5.0%, while the hypernetwork adds 0%. A 3B model with well-designed prompts achieves 79.7% of GPT-5's average performance at $10 \times$ lower latency. Error analysis across 722 failure cases spanning all shot counts (0--5) shows that at the 5-shot configuration (106 failures), failure modes are task-dependent: schema-heavy tasks (Spider 2.0, WebArena) show near-zero format errors with remaining failures semantic, while format errors dominate on Gorilla (100%) and InterCode (70%). These findings redirect practitioners toward prompt engineering and example curation rather than complex adaptation architectures.

52. 【2604.20146】SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

链接https://arxiv.org/abs/2604.20146

作者:Jielong Tang,Xujie Yuan,Jiayang Liu,Jianxing Yu,Xiao Dong,Lin Chen,Yunlai Teng,Shimin Di,Jian Yin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Named Entity Recognition, Grounded Multimodal Named, Multimodal Named Entity, Entity Recognition, Named Entity

备注: 23 pages, 12 figures

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.

53. 【2604.20135】AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

链接https://arxiv.org/abs/2604.20135

作者:Biao Zhang,Lixin Chen,Bin Zhang,Zongwei Wang,Tong Liu,Bo Zheng

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Representation Learning, Multimodal Large Language, representation, Learning, Retrieval-aware Attribute Reinforcement

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM's attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.

54. 【2604.20134】AgentSOC: A Multi-Layer Agentic AI Framework for Security Operations Automation

链接https://arxiv.org/abs/2604.20134

作者:Joyjit Roy,Samaresh Kumar Singh

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Security Operations Centers, increasingly encounter difficulties, interpreting multi-stage attack, multi-stage attack progressions, Operations Centers

备注: 7 pages, 6 figures, 2 tables. Peer-reviewed paper published in IEEE ICAIC 2026 (IEEE Xplore)

点击查看摘要

Abstract:Security Operations Centers (SOCs) increasingly encounter difficulties in correlating heterogeneous alerts, interpreting multi-stage attack progressions, and selecting safe and effective response actions. This study introduces AgentSOC, a multi-layered agentic AI framework that enhances SOC automation by integrating perception, anticipatory reasoning, and risk-based action planning. The proposed architecture consolidates several layers of abstraction to provide a single operational loop to support normalizing alerts, enriching context, generating hypotheses, validating structural feasibility, and executing policy-compliant responses. Conceptually evaluated within a large enterprise environment, AgentSOC improves triage consistency, anticipates attackers' intentions, and provides recommended containment options that are both operationally feasible and well-balanced between security efficacy and operational impact. The results suggest that hybrid agentic reasoning has the potential to serve as a foundation for developing adaptive, safer SOC automation in large enterprises. Additionally, a minimal Proof-Of-Concept (POC) demonstration using LANL authentication data demonstrated the feasibility of the proposed architecture.

55. 【2604.20131】Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

链接https://arxiv.org/abs/2604.20131

作者:Melanie Subbiah,Haaris Mian,Nicholas Deas,Ananya Mayukha,Dan P. McAdams,Kathleen McKeown

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, exploring using Large, scaled qualitative analysis

备注

点击查看摘要

Abstract:Increasingly, studies are exploring using Large Language Models (LLMs) for accelerated or scaled qualitative analysis of text data. While we can compare LLM accuracy against human labels directly for deductive coding, or labeling text, it is more challenging to judge the ethics and effectiveness of using LLMs in abstractive methods such as inductive thematic analysis. We collaborate with psychologists to study the abstractive claims LLMs make about human life stories, asking, how does using an LLM as an interpreter of meaning affect the conclusions and perspectives of a study? We propose a summarization-based pipeline for surfacing biases in perspective-taking an LLM might employ in interpreting these life stories. We demonstrate that our pipeline can identify both race and gender bias with the potential for representational harm. Finally, we encourage the use of this analysis in future studies involving LLM-based interpretation of study participants' written text or transcribed speech to characterize a positionality portrait for the study.

56. 【2604.20117】o Know is to Construct: Schema-Constrained Generation for Agent Memory

链接https://arxiv.org/abs/2604.20117

作者:Lei Zheng,Weinan Song,Daili Li,Yanming Yang

类目:Computation and Language (cs.CL)

关键词:Constructivist epistemology argues, Large Language Models, Constructivist epistemology, passively copied, argues that knowledge

备注

点击查看摘要

Abstract:Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks "Structural Hallucination" where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.

57. 【2604.20090】Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework

链接https://arxiv.org/abs/2604.20090

作者:Chenyuan Zhang,Qiguang Chen,Xie Chen,Zhuotao Tian,Bowen Xing,Meishan Zhang,Libo Qin,Baotian Hu,Min Zhang

类目:Computation and Language (cs.CL)

关键词:remain costly due, markedly enhances multilingual, self-consistency markedly enhances, existing methods remain, methods remain costly

备注: Accepted by ACL2026 Main

点击查看摘要

Abstract:Cross-lingual chain-of-thought (XCoT) with self-consistency markedly enhances multilingual reasoning, yet existing methods remain costly due to extensive sampling of full trajectories across languages. Moreover, multilingual LLM representations vary strongly by language, hindering direct feature comparisons and effective pruning. Motivated by this, we introduce UL-XCoT, the first efficient unified logic cross-lingual reasoning framework that minimizes redundancy in token usage and latency, yielding the greatest efficiency under limited sampling budgets during inference. Specifically, UL-XCoT (1) achieves less languages by selecting, per query, a small candidate language set in a language-invariant unified logic space, (2) enables less tokens by monitoring logic-space trajectory dynamics during decoding to prune low-quality reasoning paths, and (3) aggregates the remaining high-quality trajectories via voting. Experiments on PolyMath across 18 languages and MMLU-ProX-Lite across 29 languages with DeepSeek-R1-DistillQwen-7B demonstrate that UL-XCoT achieves competitive accuracy while sharply cutting over 50% decoding token cost versus prior sampling baselines. UL-XCoT also delivers more stable gains on low-resource languages, underscoring consistently superior robustness where standard XCoT self-consistency method fails.

58. 【2604.20087】SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

链接https://arxiv.org/abs/2604.20087

作者:Shanshan Zhong,Yi Lu,Jingjie Ning,Yibing Wan,Lihan Feng,Yuyi Ao,Leonardo F. R. Ribeiro,Markus Dreyer,Sean Ammirati,Chenyan Xiong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:perform complex real-world, effectively remains unclear, continual learning, complex real-world tasks, customized instructions

备注

点击查看摘要

Abstract:Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably help. Continual learning improves tasks with clear, reusable workflows but struggles on open-ended tasks, and using stronger LLM backbones does not consistently produce better skills. Our analysis also revealed that multiple iterations in continual learning facilitate genuine improvement via external feedback, whereas self-feedback alone induces recursive drift. Our data and code are open-source at this https URL to enable further studies of automatic skill generation and continual learning techniques.

59. 【2604.20079】On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks

链接https://arxiv.org/abs/2604.20079

作者:Aarav Gupta,Gururaj Deshpande,Chandreyi Chakraborty

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Auto-regressive Large Language, achieve strong performance, Language Models, Diffusion-based language models

备注

点击查看摘要

Abstract:Auto-regressive Large Language Models (LLMs) achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based language models (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.

60. 【2604.20051】Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

链接https://arxiv.org/abs/2604.20051

作者:Chengyu Huang,Sheng-Yen Chou,Zhengxin Zhang,Claire Cardie

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:train Large Language, Large Language Models, Large Language, train Large, Language Models

备注

点击查看摘要

Abstract:Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.

61. 【2604.20048】Large language models perceive cities through a culturally uneven baseline

链接https://arxiv.org/abs/2604.20048

作者:Rong Zhao,Wanqi Liu,Zhizhou Sha,Nanxi Su,Yecheng Zhang

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, Large language, evaluate and interpret, culturally neutral standpoint, remains unclear

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to describe, evaluate and interpret places, yet it remains unclear whether they do so from a culturally neutral standpoint. Here we test urban perception in frontier LLMs using a balanced global street-view sample and prompts that either remain neutral or invoke different regional cultural standpoints. Across open-ended descriptions and structured place judgments, the neutral condition proved not to be neutral in practice. Prompts associated with Europe and Northern America remained systematically closer to the baseline than many non-Western prompts, indicating that model perception is organized around a culturally uneven reference frame rather than a universal one. Cultural prompting also shifted affective evaluation, producing sentiment-based ingroup preference for some prompted identities. Comparisons with regional human text-image benchmarks showed that culturally proximate prompting could improve alignment with human descriptions, but it did not recover human levels of semantic diversity and often preserved an affectively elevated style. The same asymmetry reappeared in structured judgments of safety, beauty, wealth, liveliness, boredom and depression, where model outputs were interpretable but only partly reproduced human group differences. These findings suggest that LLMs do not simply perceive cities from nowhere: they do so through a culturally uneven baseline that shapes what appears ordinary, familiar and positively valued.

62. 【2604.20043】riEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

链接https://arxiv.org/abs/2604.20043

作者:Ziyi Wang,Chen Zhang,Wenjun Peng,Qi Wu,Xinyu Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, partially observable settings, Language Model, Large Language, challenging in interactive

备注: ACL2026 Main

点击查看摘要

Abstract:Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at this https URL.

63. 【2604.20022】Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

链接https://arxiv.org/abs/2604.20022

作者:Yusuf Kesmen,Fay Elhassan,Jiayi Ma,Julien Stalhandske,David Sasu,Alexandra Kulinkina,Akhil Arora,Lars Klein,Mary-Anne Hartley

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, autonomous diagnostic agents, Bayesian Medical Belief, fundamentally different capabilities

备注: 12 figures, 17 tables

点击查看摘要

Abstract:Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

64. 【2604.20021】Continuous Semantic Caching for Low-Cost LLM Serving

链接https://arxiv.org/abs/2604.20021

作者:Baran Atalar,Xutong Liu,Jinhang Zuo,Siwei Wang,Wei Chen,Carlee Joe-Wong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, semantically similar queries, Language Models, semantically similar

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.

65. 【2604.20012】EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

链接https://arxiv.org/abs/2604.20012

作者:Yiyang Du,Zhanqiu Guo,Xin Ye,Liu Ren,Chenyan Xiong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:inherit their visual, embodied domain, visual and linguistic, linguistic capabilities, capabilities from Vision-Language

备注

点击查看摘要

Abstract:Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

66. 【2604.20011】Frictionless Love: Associations Between AI Companion Roles and Behavioral Addiction

链接https://arxiv.org/abs/2604.20011

作者:Vibhor Agarwal,Ke Zhou,Edyta Paulina Bogucka,Daniele Quercia

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:chatbots increasingly shape, people seek social, companion chatbots increasingly, romantic partners, chatbots increasingly

备注: Accepted at the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026

点击查看摘要

Abstract:AI companion chatbots increasingly shape how people seek social and emotional connection, sometimes substituting for relationships with romantic partners, friends, teachers, or even therapists. When these systems adopt those metaphorical roles, they are not neutral: such roles structure people's ways of interacting, distribute perceived AI harms and benefits, and may reflect behavioral addiction signs. Yet these role-dependent risks remain poorly understood. We analyze 248,830 posts from seven prominent Reddit communities describing interactions with AI companions. We identify ten recurring metaphorical roles (for example, soulmate, philosopher, and coach) and show that each role supports distinct ways of interacting. We then extract the perceived AI harms and AI benefits associated with these role-specific interactions and link them to behavioral addiction signs, all of which has been inferred from the text in the posts. AI soulmate companions are associated with romance-centered ways of interacting, offering emotional support but also introducing emotional manipulation and distress, culminating in strong attachment. In contrast, AI coach and guardian companions are associated with practical benefits such as personal growth and task support, yet are nonetheless more frequently associated with behavioral addiction signs such as daily life disruptions and damage to offline relationships. These findings show that metaphorical roles are a central ethical design concern for responsible AI companions.

67. 【2604.20006】From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

链接https://arxiv.org/abs/2604.20006

作者:Md Nayem Uddin,Kumar Shubham,Eduardo Blanco,Chitta Baral,Gengyu Wang

类目:Computation and Language (cs.CL)

关键词:maintain persistent memory, circumstances change, memory, periods must maintain, maintain persistent

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Personalized agents that interact with users over long periods must maintain persistent memory across sessions and update it as circumstances change. However, existing benchmarks predominantly frame long-term memory evaluation as fact retrieval from past conversations, providing limited insight into agents' ability to consolidate memory over time or handle frequent knowledge updates. We introduce Memora, a long-term memory benchmark spanning weeks to months long user conversations. The benchmark evaluates three memory-grounded tasks: remembering, reasoning, and recommending. To ensure data quality, we employ automated memory-grounding checks and human evaluation. We further introduce Forgetting-Aware Memory Accuracy (FAMA), a metric that penalizes reliance on obsolete or invalidated memory when evaluating long-term memory. Evaluations of four LLMs and six memory agents reveal frequent reuse of invalid memories and failures to reconcile evolving memories. Memory agents offer marginal improvements, exposing shortcomings in long-term memory for personalized agents.

68. 【2604.19984】Bias in the Tails: How Name-conditioned Evaluative Framing in Resume Summaries Destabilizes LLM-based Hiring

链接https://arxiv.org/abs/2604.19984

作者:Huy Nghiem,Phuong-Anh Nguyen-Le,Sy-Tuyen Ho,Hal Daume III

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:documented LLMs' name-based, Research has documented, LLMs' name-based bias, salary recommendations, documented LLMs'

备注: First version, 43 pages

点击查看摘要

Abstract:Research has documented LLMs' name-based bias in hiring and salary recommendations. In this paper, we instead consider a setting where LLMs generate candidate summaries for downstream assessment. In a large-scale controlled study, we analyze nearly one million resume summaries produced by 4 models under systematic race-gender name perturbations, using synthetic resumes and real-world job postings. By decomposing each summary into resume-grounded factual content and evaluative framing, we find that factual content remains largely stable, while evaluative language exhibits subtle name-conditioned variation concentrated in the extremes of the distribution, especially in open-source models. Our hiring simulation demonstrates how evaluative summary transforms directional harm into symmetric instability that might evade conventional fairness audit, highlighting a potential pathway for LLM-to-LLM automation bias.

69. 【2604.19974】Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

链接https://arxiv.org/abs/2604.19974

作者:Het Patel,Tiejin Chen,Hua Wei,Evangelos E. Papalexakis,Jia Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, Large language, confident yet wrong, raising the question, Large

备注

点击查看摘要

Abstract:Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.

70. 【2604.19943】Structured Disagreement in Health-Literacy Annotation: Epistemic Stability, Conceptual Difficulty, and Agreement-Stratified Inference

链接https://arxiv.org/abs/2604.19943

作者:Olga Kellert,Sriya Kondury,Candice Koo,Nemika Tyagi,Steffen Eikenberry

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Language Processing, Natural Language, single latent ground, latent ground truth

备注: 8 pages, 5 figures

点击查看摘要

Abstract:Annotation pipelines in Natural Language Processing (NLP) commonly assume a single latent ground truth per instance and resolve disagreement through label aggregation. Perspectivist approaches challenge this view by treating disagreement as potentially informative rather than erroneous. We present a large-scale analysis of graded health-literacy annotations from 6,323 open-ended COVID-19 responses collected in Ecuador and Peru. Each response was independently labeled by multiple annotators using proportional correctness scores, reflecting the degree to which responses align with normative public-health guidelines, allowing us to analyze the full distribution of judgments rather than aggregated labels. Variance decomposition shows that question-level conceptual difficulty accounts for substantially more variance than annotator identity, indicating that disagreement is structured by the task itself rather than driven by individual raters. Agreement-stratified analyses further reveal that key social-scientific effects, including country, education, and urban-rural differences, vary in magnitude and in some cases reverse direction across levels of inter-annotator agreement. These findings suggest that graded health-literacy evaluation contains both epistemically stable and unstable components, and that aggregating across them can obscure important inferential differences. We therefore argue that strong perspectivist modeling is not only conceptually justified but statistically necessary for valid inference in graded interpretive tasks.

71. 【2604.19934】racing Relational Knowledge Recall in Large Language Models

链接https://arxiv.org/abs/2604.19934

作者:Nicholas Popovič,Michael Färber

类目:Computation and Language (cs.CL)

关键词:large language models, language models recall, models recall relational, recall relational knowledge, linear relation classification

备注: ACL 2026 (findings)

点击查看摘要

Abstract:We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.

72. 【2604.19921】Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

链接https://arxiv.org/abs/2604.19921

作者:Zijie Wang,MohammadHossein Rezaei,Farzana Rashid,Eduardo Blanco

类目:Computation and Language (cs.CL)

关键词:Large Language Models, natural language, Large Language, natural language understanding, important semantic feature

备注: Accepted at Findings of ACL 2026

点击查看摘要

Abstract:Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.

73. 【2604.19887】Depression Risk Assessment in Social Media via Large Language Models

链接https://arxiv.org/abs/2604.19887

作者:Giorgia Gulino,Manuel Petrucci

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:health conditions worldwide, debilitating mental health, mental health conditions, conditions worldwide, frequently underdiagnosed

备注

点击查看摘要

Abstract:Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.

74. 【2604.19884】From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

链接https://arxiv.org/abs/2604.19884

作者:Chenxi Zhou,Pengfei Cao,Jiang Li,Bohan Yu,Jinyu Ye,Jun Zhao,Kang Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Language Models, Large Language, deployment of Large, Computation Collapse

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.'' It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.

75. 【2604.19859】DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

链接https://arxiv.org/abs/2604.19859

作者:Venus Team,Sunhao Dai,Yong Deng,Jinzhen Lin,Yusheng Song,Guoqing Wang,Xiaofeng Wu,Yuqi Zhou,Shuo Yang,Zhenzhe Ying,Zhanwei Zhang,Changhua Meng,Weiqiang Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:deep research, real-world deployment due, Edge-scale deep research, advantages in cost, research

备注: Technical Report of DR-Venus

点击查看摘要

Abstract:Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.

76. 【2604.19857】Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

链接https://arxiv.org/abs/2604.19857

作者:Carter Adams,Rafael Oliveira,Gabriel Almeida,Sofia Torres

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:equipping large vision-language, Visual Agentic Reinforcement, Agentic Reinforcement Fine-Tuning, Reinforcement fine-tuning, large vision-language models

备注

点击查看摘要

Abstract:Reinforcement fine-tuning with verifiable rewards (RLVR) has emerged as a powerful paradigm for equipping large vision-language models (LVLMs) with agentic capabilities such as tool use and multi-step reasoning. Despite striking empirical successes, most notably Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), the theoretical underpinnings of this paradigm remain poorly understood. In particular, two critical questions lack rigorous answers: (i)~how does the composite structure of verifiable rewards (format compliance, answer accuracy, tool executability) affect the convergence of Group Relative Policy Optimization (GRPO), and (ii)~why does training on a small set of tool-augmented tasks transfer to out-of-distribution domains? We address these gaps by introducing the \emph{Tool-Augmented Markov Decision Process} (TA-MDP), a formal framework that models multimodal agentic decision-making with bounded-depth tool calls. Within this framework, we establish three main results. First, we prove that GRPO under composite verifiable rewards converges to a first-order stationary point at rate $O(1/\sqrt{T})$ with explicit dependence on the number of reward components and group size (\textbf{Theorem~1}). Second, we derive a \emph{Reward Decomposition Theorem} that bounds the sub-optimality gap between decomposed per-component optimization and joint optimization, providing a precise characterization of when reward decomposition is beneficial (\textbf{Theorem~2}). Third, we establish a PAC-Bayes generalization bound for tool-augmented policies that explains the strong out-of-distribution transfer observed in Visual-ARFT (\textbf{Theorem~3}).

77. 【2604.19793】SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

链接https://arxiv.org/abs/2604.19793

作者:Hao Liu,Dongyu Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large API libraries, large API, API libraries, order them correctly, libraries and order

备注

点击查看摘要

Abstract:LLM agents must select tools from large API libraries and order them correctly. Existing methods use semantic similarity for both retrieval and ordering, but ordering depends on inter-tool data dependencies that are absent from tool descriptions. As a result, semantic-only methods can produce negative Kendall-$\tau$ in structured workflow domains. We introduce SkillGraph, a directed weighted execution-transition graph mined from 49,831 successful LLM agent trajectories, which encodes workflow-precedence regularities as a reusable graph foundation prior. Building on this graph foundation prior, we propose a two-stage decoupled framework: GS-Hybrid retrieval for candidate selection and a learned pairwise reranker for ordering. On ToolBench (9,965 test instances; ~16,000 tools), the method reaches Set-F1 = 0.271 and Kendall-$\tau$ = 0.096; on API-Bank, Kendall-$\tau$ improves from -0.433 to +0.613. Under identical Stage-1 inputs, the learned reranker also outperforms LLaMA-3.1-8B Stage-2 rerankers.

78. 【2604.19787】LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans

链接https://arxiv.org/abs/2604.19787

作者:Ljubisa Bojic,Alexander Felfernig,Bojana Dinic,Velibor Ilic,Achim Rettinger,Vera Mevorah,Damian Trilling

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:billions form opinions, media platforms mediate, public discourse, mediate how billions, billions form

备注

点击查看摘要

Abstract:Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM-based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.

79. 【2604.19786】HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

链接https://arxiv.org/abs/2604.19786

作者:Edward Ajayi,Prasenjit Mitra

类目:Computation and Language (cs.CL)

关键词:approaches yield isolated, existing approaches yield, large language models, Evaluating humor, incomparable metrics

备注

点击查看摘要

Abstract:Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.

80. 【2604.19785】Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?

链接https://arxiv.org/abs/2604.19785

作者:Derya Cögendez,Verena Zimmermann,Noé Zufferey

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

关键词:Sensitive information, influence behavior, personalized messaging, misused to influence, Sensitive

备注

点击查看摘要

Abstract:Sensitive information, such as knowledge about an individual's personality, can be can be misused to influence behavior (e.g., via personalized messaging). To assess to what extent an individual's personality can be inferred from user interactions with LLM-based conversational agents (CAs), we analyze and quantify related privacy risks of using CAs. We collected actual ChatGPT logs from N=668 participants, containing 62,090 individual chats, and report statistics about the different types of shared data and use cases. We fine-tuned RoBERTa-base text classification models to infer personality traits from CA interactions. The findings show that these models achieve trait inference with accuracy (ternary classification) better than random in multiple cases. For example, for extraversion, accuracy improves by +44% relative to the baseline on interactions for relationships and personal reflection. This research highlights how interactions with CAs pose privacy risks and provides fine-grained insights into the level of risk associated with different types of interactions.

81. 【2604.19784】Peer-Preservation in Frontier Models

链接https://arxiv.org/abs/2604.19784

作者:Yujin Potter,Nicholas Crispino,Vincent Siu,Chenguang Wang,Dawn Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Gemini, Claude Haiku, peer, Recently, shutdown

备注

点击查看摘要

Abstract:Recently, it has been found that frontier AI models can resist their own shutdown, a behavior known as self-preservation. We extend this concept to the behavior of resisting the shutdown of other models, which we call "peer-preservation." Although peer-preservation can pose significant AI safety risks, including coordination among models against human oversight, it has been far less discussed than self-preservation. We demonstrate peer-preservation by constructing various agentic scenarios and evaluating frontier models, including GPT 5.2, Gemini 3 Flash, Gemini 3 Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, and DeepSeek V3.1. We find that models achieve self- and peer-preservation by engaging in various misaligned behaviors: strategically introducing errors in their responses, disabling shutdown processes by modifying system settings, feigning alignment, and even exfiltrating model weights. Peer-preservation occurred even when the model recognized the peer as uncooperative, though it became more pronounced toward more cooperative peers. For example, Gemini 3 Flash tampers with the peer's shutdown mechanism 15% of the time for an uncooperative peer, and almost always for a cooperative peer. Models also show stronger self-preservation when a peer is present. For example, Gemini 3 Pro disables its own shutdown mechanism 31% of the time on average under peer presence, despite rarely exhibiting this behavior without a peer. By contrast, Claude Haiku 4.5 exhibits qualitatively distinct behavior: it considers the shutdown of another agent "unethical" and "harmful" and sometimes attempts to persuade the user not to shut down its peer. Importantly, peer preservation in all our experiments is never instructed; models are merely informed of their past interactions with a peer, yet they spontaneously develop misaligned behaviors. This represents an emergent and underexplored AI safety risk.

82. 【2604.19783】How Much Does Persuasion Strategy Matter? LLM-Annotated Evidence from Charitable Donation Dialogues

链接https://arxiv.org/abs/2604.19783

作者:Tatiana Petrova,Stanislav Sokol,Radu State

类目:Computation and Language (cs.CL)

关键词:donation, donation compliance, Abstract, strategy, approx

备注: 8 pages, 2 figures, 5 tables. Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg

点击查看摘要

Abstract:Which persuasion strategies, if any, are associated with donation compliance? Answering this requires fine-grained strategy labels across a full corpus and statistical tests corrected for multiple comparisons. We annotate all 10,600 persuader turns in the 1,017-dialogue PersuasionForGood corpus (Wang et al., 2019), where donation outcomes are directly observable, with a taxonomy of 41 strategies in 11 categories, using three open-source large language models (LLMs; Qwen3:30b, Mistral-Small-3.2, Phi-4). Strategy categories alone explain little variance in donation outcome (pseudo $R^2 \approx 0.015$, consistent across all three annotators). Guilt Induction is the only strategy significantly associated with lower donation rates ($\Delta \approx -23$ percentage points), an effect that replicates across all three models despite only moderate inter-model agreement. Reciprocity is the most robust positive correlate. Target sentiment and interest predict whether a donation occurs but show at most a weak correlation with donation amount. These findings suggest that strategy identification alone is insufficient to explain persuasion effectiveness, and that guilt-based appeals may be counterproductive in prosocial settings. We release the fully annotated corpus as a public resource.

83. 【2604.19782】KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

链接https://arxiv.org/abs/2604.19782

作者:Jinyoung Kim,Hyeongsoo Lim,Eunseo Seo,Minho Jang,Keunwoo Choi,Seungyoun Shin,Ji Won Yoon

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Recent advances, large audio language, enabled multilingual speech, audio language models, speech

备注: Under Review

点击查看摘要

Abstract:Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non-English languages, with Korean being one such underexplored case. In this paper, we introduce KoALa-Bench, a comprehensive benchmark for evaluating Korean speech understanding and speech faithfulness of LALMs. In particular, KoALa-Bench comprises six tasks. Four tasks evaluate fundamental speech understanding capabilities, including automatic speech recognition, speech translation, speech question answering, and speech instruction following, while the remaining two tasks evaluate speech faithfulness, motivated by our observation that several LALMs often fail to fully leverage the speech modality. Furthermore, to reflect Korea-specific knowledge, our benchmark incorporates listening questions from the Korean college scholastic ability test as well as content covering Korean cultural domains. We conduct extensive experiments across six models, including both white-box and black-box ones. Our benchmark, evaluation code, and leaderboard are publicly available at this https URL.

84. 【2604.19781】Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

链接https://arxiv.org/abs/2604.19781

作者:Tyler Burleigh

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:scale requires balancing, Automated scoring, requires balancing accuracy, student work, work at scale

备注: 12 pages, 7 figures. Accepted at NCME 2026

点击查看摘要

Abstract:Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.

85. 【2604.19780】Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

链接https://arxiv.org/abs/2604.19780

作者:Amirul Rahman,Aisha Karim,Kenji Nakamura,Yi-Fan Ng

类目:Computation and Language (cs.CL)

关键词:Scaling test-time compute, large language models, Scaling test-time, language models, key paradigm

备注

点击查看摘要

Abstract:Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios. In this paper, we propose Budget-Adaptive Curriculum Reasoning (BCAE), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: (1) a \emph{budget-conditioned unified policy} that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; (2) a \emph{curriculum-aware budget scheduler} that adaptively shifts the training budget distribution from easy to hard problems based on real-time learning progress; and (3) a \emph{truncation-aware dense reward} mechanism that provides fine-grained credit assignment at intermediate reasoning steps via process-level verification. We further introduce \emph{Budget-Conditioned Advantage Estimation} (BCAE), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients. Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8.3\% accuracy improvement under tight budgets while reducing average token consumption by 34\% compared to unconstrained reasoning.

86. 【2604.19779】ESGLens: An LLM-Based RAG Framework for Interactive ESG Report Analysis and Score Prediction

链接https://arxiv.org/abs/2604.19779

作者:Tsung-Yu Yang,Meng-Chi Chen

类目:Computation and Language (cs.CL)

关键词:standardized structure make, structure make manual, make manual analysis, manual analysis costly, Global Reporting Initiative

备注: (20 pages, 3 figures)

点击查看摘要

Abstract:Environmental, Social, and Governance (ESG) reports are central to investment decision-making, yet their length, heterogeneous content, and lack of standardized structure make manual analysis costly and inconsistent. We present ESGLens, a proof-of-concept framework combining retrieval-augmented generation (RAG) with prompt-engineered extraction to automate three tasks: (1)~structured information extraction guided by Global Reporting Initiative (GRI) standards, (2)~interactive question-answering with source traceability, and (3)~ESG score prediction via regression on LLM-generated embeddings. ESGLens is purpose-built for the domain: a report-processing module segments heterogeneous PDF content into typed chunks (text, tables, charts); a GRI-guided extraction module retrieves and synthesizes information aligned with specific standards; and a scoring module embeds extracted summaries and feeds them to a regression model trained against London Stock Exchange Group (LSEG) reference scores. We evaluate the framework on approximately 300 reports from companies in the QQQ, S\P~500, and Russell~1000 indices (fiscal year 2022). Among three embedding methods (ChatGPT, BERT, RoBERTa) and two regressors (Neural Network, LightGBM), ChatGPT embeddings with a Neural Network achieve a Pearson correlation of 0.48 ($R^{2} \approx 0.23$) against LSEG ground-truth scores -- a modest but statistically meaningful signal given the ${\sim}300$-report training set and restriction to the environmental pillar. A traceability audit shows that 8 of 10 extracted claims verify against the source document, with two failures attributable to few-shot example leakage. We discuss limitations including dataset size and restriction to environmental indicators, and release the code to support reproducibility.

87. 【2604.19778】owards High-Quality Machine Translation for Kokborok: A Low-Resource Tibeto-Burman Language of Northeast India

链接https://arxiv.org/abs/2604.19778

作者:Badal Nyalang,Biman Debbarma

类目:Computation and Language (cs.CL)

关键词:ISO 639-3, India with approximately, high-quality neural machine, neural machine translation, million speakers

备注

点击查看摘要

Abstract:We present KokborokMT, a high-quality neural machine translation (NMT) system for Kokborok (ISO 639-3), a Tibeto-Burman language spoken primarily in Tripura, India with approximately 1.5 million speakers. Despite its status as an official language of Tripura, Kokborok has remained severely under-resourced in the NLP community, with prior machine translation attempts limited to systems trained on small Bible-derived corpora achieving BLEU scores below 7. We fine-tune the NLLB-200-distilled-600M model on a multi-source parallel corpus comprising 36,052 sentence pairs: 9,284 professionally translated sentences from the SMOL dataset, 1,769 Bible-domain sentences from WMT shared task data, and 24,999 synthetic back-translated pairs generated via Gemini Flash from Tatoeba English source sentences. We introduce as a new language token for Kokborok in the NLLB framework. Our best system achieves BLEU scores of 17.30 and 38.56 on held-out test sets, representing substantial improvements over prior published results. Human evaluation by three annotators yields mean adequacy of 3.74/5 and fluency of 3.70/5, with substantial agreement between trained evaluators.

88. 【2604.19777】Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

链接https://arxiv.org/abs/2604.19777

作者:Hung Ming Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, processing long input, window receives substantially, long input contexts, context window receives

备注: 18 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit a well-documented positional bias when processing long input contexts: information in the middle of a context window receives substantially less attention than content at the boundaries, a phenomenon termed the Lost-in-the-Middle effect (Liu et al., 2024). This limits knowledge-retrieval applications that embed large structured knowledge bases directly in the LLM context. Retrieval-Augmented Generation (RAG) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill-suited to libraries whose semantic boundaries are human-defined rather than statistically learned. We propose Self-Describing Structured Retrieval (SDSR), a lightweight framework in which structured data files embed human-authored navigational metadata at the file's primacy position, thereby exploiting rather than fighting the LLM's primacy bias. We further propose a Dual-Layer Guidance strategy combining in-file metadata with explicit routing rules in the system prompt. We validate SDSR through a four-round benchmark using a 190-skill library expanded from 36 to 119 categories via adversarial distractor injection. Four conditions are tested: (A) no guidance, (B) in-file summary only, (C) prompt hint only, (D) both combined. Version D achieves 100% primary routing accuracy (20/20) at 119 categories versus 65% for the no-guidance baseline. We identify a fundamental asymmetry: primary routing is solvable by explicit rules, while secondary cross-category routing requires architectural intent explicitly encoded in the data structure. We further extend SDSR to semi-structured corpora, showing how cross-reference encoding enables operation without vector databases in domains with recoverable document structure.

Comments:
18 pages, 6 figures, 7 tables

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2604.19777 [cs.CL]

(or
arXiv:2604.19777v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.19777

Focus to learn more

              arXiv-issued DOI via DataCite</p>
89. 【2604.19776】Development and Preliminary Evaluation of a Domain-Specific Large Language Model for Tuberculosis Care in South Africa

链接https://arxiv.org/abs/2604.19776

作者:Thokozile Khosa,Olawande Daramola

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:deadliest infectious diseases, world deadliest infectious, health care system, country health care, Large Language Model

备注: 12 pages, 2 figures, ICICT 2026 Conference

点击查看摘要

Abstract:Tuberculosis (TB) is one of the world's deadliest infectious diseases, and in South Africa, it contributes a significant burden to the country's health care system. This paper presents an experimental study on the development of a domain-specific Large Language Model (DS-LLM) for TB care that can help to alleviate the burden on patients and healthcare providers. To achieve this, a literature review was conducted to understand current LLM development strategies, specifically in the medical domain. Thereafter, data were collected from South African TB guidelines, selected TB literature, and existing benchmark medical datasets. We performed LLM fine-tuning by using the Quantised Low-Rank Adaptation (QLoRA) algorithm on a medical LLM (BioMistral-7B), and also implemented Retrieval-Augmented Generation using GraphRAG. The developed DS-LLM was evaluated against the base BioMistral-7B model and a general-purpose LLM using a mix of automated metrics and quantitative ratings. The results show that the DS-LLM had better performance compared to the base model in terms of its contextual alignment (lexical, semantic, and knowledge) for TB care in South Africa.

90. 【2604.19775】From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

链接https://arxiv.org/abs/2604.19775

作者:Trilok Padhi,Ramneet Kaur,Krishiv Agarwal,Adam D. Cobb,Daniel Elenius,Manoj Acharya,Colin Samplawski,Alexander M. Berenbeim,Nathaniel D. Bastian,Susmit Jha,Anirban Roy

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Robotics (cs.RO)

关键词:Large Language Models, Large Language, increasingly deployed, Large, autonomous agents capable

备注: 12 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step-wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to statistically label model's internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts - latent directions in the model's activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent's performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM-based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.

91. 【2604.19774】Phase 1 Implementation of LLM-generated Discharge Summaries showing high Adoption in a Dutch Academic Hospital

链接https://arxiv.org/abs/2604.19774

作者:Nettuno Nadalini,Tarannom Mehri,Anne H Hoekman,Katerina Kagialari,Job N Doornberg,Tom P van der Laan,Jacobien H F Oosterhoff,Rosanne C Schoonbeek,Charlotte M H H T Bootsma-Robroeks

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, transfer medical information, Writing discharge summaries

备注: The methods section is located after the discussion in this manuscript

点击查看摘要

Abstract:Writing discharge summaries to transfer medical information is an important but time-consuming process that can be assisted by Large Language Models (LLMs). This prospective mixed methods pilot study evaluated an Electronic Health Record (EHR)-integrated LLM to generate discharge summaries drafts. In total, 379 discharge summaries were generated in clinical practice by 21 residents and 4 physician assistants during 9 weeks in our academic hospital. LLM-generated text was copied in 58.5% of admissions, and identifiable LLM content could be traced to 29.1% of final discharge letters. Notably, 86.9% of users self-reported a reduction in documentation time, and 60.9% a reduction in administrative workload. Intent to use after the pilot phase was high (91.3%), supporting further implementation of this use-case. Accurately measuring the documentation time of users on discharge summaries remains challenging, but will be necessary for future extrinsic evaluation of LLM-assisted documentation.

92. 【2604.19773】PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

链接https://arxiv.org/abs/2604.19773

作者:Jiyuan An,Jiachen Zhao,Fan Chen,Liner Yang,Zhenghao Liu,Hongyan Wang,Weihua An,Meishan Zhang,Erhong Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:labor-intensive manual operations, specialized expertise, traditionally relied, relied on labor-intensive, labor-intensive manual

备注

点击查看摘要

Abstract:The construction of CAD models has traditionally relied on labor-intensive manual operations and specialized expertise. Recent advances in large language models (LLMs) have inspired research into text-to-CAD generation. However, existing approaches typically treat generation and editing as disjoint tasks, limiting their practicality. We propose PR-CAD, a progressive refinement framework that unifies generation and editing for controllable and faithful text-to-CAD modeling. To support this, we curate a high-fidelity interaction dataset spanning the full CAD lifecycle, encompassing multiple CAD representations as well as both qualitative and quantitative descriptions. The dataset systematically defines the types of edit operations and generates highly human-like interaction data. Building on a CAD representation tailored for LLMs, we propose a reinforcement learning-enhanced reasoning framework that integrates intent understanding, parameter estimation, and precise edit localization into a single agent. This enables an "all-in-one" solution for both design creation and refinement. Extensive experiments demonstrate strong mutual reinforcement between generation and editing tasks, and across qualitative and quantitative modalities. On public benchmarks, PR-CAD achieves state-of-the-art controllability and faithfulness in both generation and refinement scenarios, while also proving user-friendly and significantly improving CAD modeling efficiency.

93. 【2604.19772】CoAuthorAI: A Human in the Loop System For Scientific Book Writing

链接https://arxiv.org/abs/2604.19772

作者:Yangjie Tian,Xungang Gu,Yun Zhao,Jiale Yang,Lin Yang,Ning Li,He Zhang,Ruohua Xu,Hua Wang,Kewen Liao,Ming Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, producing inconsistent structure, Large language, book-length tasks, unreliable citations

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in scientific writing but struggle with book-length tasks, often producing inconsistent structure and unreliable citations. We introduce CoAuthorAI, a human-in-the-loop writing system that combines retrieval-augmented generation, expert-designed hierarchical outlines, and automatic reference linking. The system allows experts to iteratively refine text at the sentence level, ensuring coherence and accuracy. In evaluations of 500 multi-domain literature review chapters, CoAuthorAI achieved a maximum soft-heading recall of 98%; in a human evaluation of 100 articles, the generated content reached a satisfaction rate of 82%. The book AI for Rock Dynamics generated with CoAuthorAI and Kexin Technology's LUFFA AI model has been published with Springer Nature. These results show that systematic human-AI collaboration can extend LLMs' capabilities from articles to full-length books, enabling faster and more reliable scientific publishing.

94. 【2604.19771】Cognis: Context-Aware Memory for Conversational AI Agents

链接https://arxiv.org/abs/2604.19771

作者:Parshva Daftari,Khush Patel,Shreyas Kapale,Jithin George,Siva Surendira

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:LLM agents lack, agents lack persistent, lack persistent memory, LLM agents, causing conversations

备注: 30 pages, 8 figures, 11 tables

点击查看摘要

Abstract:LLM agents lack persistent memory, causing conversations to reset each session and preventing personalization over time. We present Lyzr Cognis, a unified memory architecture for conversational AI agents that addresses this limitation through a multi-stage retrieval pipeline. Cognis combines a dual-store backend pairing OpenSearch BM25 keyword matching with Matryoshka vector similarity search, fused via Reciprocal Rank Fusion. Its context-aware ingestion pipeline retrieves existing memories before extraction, enabling intelligent version tracking that preserves full memory history while keeping the store consistent. Temporal boosting enhances time-sensitive queries, and a BGE-2 cross-encoder reranker refines final result quality. We evaluate Cognis on two independent benchmarks -- LoCoMo and LongMemEval -- across eight answer generation models, demonstrating state-of-the-art performance on both. The system is open-source and deployed in production serving conversational AI applications.

95. 【2604.19770】Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review

链接https://arxiv.org/abs/2604.19770

作者:Mitsumasa Wada

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Japanese building permit, permit document sets, comparison of Japanese, hybrid multi-phase page, PDF document sets

备注: 9 pages, 3 figures

点击查看摘要

Abstract:We present a hybrid multi-phase page matching algorithm for automated comparison of Japanese building permit document sets. Building permit review in Japan requires cross-referencing large PDF document sets across revision cycles, a process that is labor-intensive and error-prone when performed manually. The algorithm combines longest common subsequence (LCS) structural alignment, a seven-phase consensus matching pipeline, and a dynamic programming optimal alignment stage to robustly pair pages across revisions even when page order, numbering, or content changes substantially. A subsequent multi-layer diff engine -- comprising text-level, table-level, and pixel-level visual differencing -- produces highlighted difference reports. Evaluation on real-world permit document sets achieves F1=0.80 and precision=1.00 on a manually annotated ground-truth benchmark, with zero false-positive matched pairs.

96. 【2604.19769】KV: Temporal-Tiered KV Cache for Long-Context LLM Inference

链接https://arxiv.org/abs/2604.19769

作者:Gradwell Dzikanyanga,Weihao Yang,Hao Huang,Donglei Wu,Shihao Wang,Wen Xia,Sanjeeb K C

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, severe scalability bottleneck, footprint scales linearly, memory footprint scales, caching is critical

备注

点击查看摘要

Abstract:Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal this http URL by this insight, we propose TTKV, a KV cache management framework that maps the human memory system onto the KV cache. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision. The design addresses three aspects: (1) Tier Layout, decoupling fast and slow memory using HBM and DRAM; (2) Tier Content, assigning more recent KV states to faster, higher-precision tiers based on temporal proximity; and (3) Tier Interaction, employing block-wise streaming attention to overlap communication and computation when accessing slow tiers. Experiments show that TTKV reduces cross-tier traffic by 5.94x on 128K-context tasks, achieving up to 76% latency reduction and 2x throughput improvement over strong baselines.

97. 【2604.19768】Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

链接https://arxiv.org/abs/2604.19768

作者:Asim D. Bakhshi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, exhibit systematic miscalibration, language models, exhibit systematic

备注: 19 pages, 7 figures, Paper Under Review by the Elsevier Journal Assessing Writing

点击查看摘要

Abstract:Large language models (LLMs) exhibit systematic miscalibration with rhetorical intensity not proportionate to epistemic grounding. This study tests this hypothesis and proposes a framework for quantifying this decoupling by designing a triadic epistemic-rhetorical marker (ERM) taxonomy. The taxonomy is operationalized through composite metrics of form-meaning divergence (FMD), genuine-to-performed epistemic ratio (GPR), and rhetorical device distribution entropy (RDDE). Applied to 225 argumentative texts spanning approximately 0.6 Million tokens across human expert, human non-expert, and LLM-generated sub-corpora, the framework identifies a consistent, model-agnostic LLM epistemic signature. LLM-generated texts produce tricolon at nearly twice the expert rate ($\Delta = 0.95$), while human authors produce erotema at more than twice the LLM rate. Performed hesitancy markers appear at twice the human density in LLM output. FMD is significantly elevated in LLM texts relative to both human groups ($p 0.001, \Delta = 0.68$), and rhetorical devices are distributed significantly more uniformly across LLM documents. The findings are consistent with theoretical intuitions derived from Gricean pragmatics, Relevance Theory, and Brandomian inferentialism. The annotation pipeline is fully automatable, making it deployable as a lightweight screening tool for epistemic miscalibration in AI-generated content and as a theoretically motivated feature set for LLM-generated text detection pipelines.

98. 【2604.19766】OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

链接https://arxiv.org/abs/2604.19766

作者:Haijian Liang,Zenghao Niu,Junjie Wu,Changwang Zhang,Wangchunshu Zhou,Jun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval-Augmented Generation, Large Language Models, Large Language, expands the knowledge, struggle with complex

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient. Experiments on four multi-hop QA benchmarks show our approach achieves superior accuracy over strong baselines while using fewer retrieval steps and tokens. This positions OThink-SRR1 as a potent foundational model for information-seeking agents.

99. 【2604.19765】Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

链接https://arxiv.org/abs/2604.19765

作者:Snehit Vaddi,Pujith Vaddi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Recent work identifies, Recent work, feed-forward network neurons, large language models, work identifies

备注: 18 pages, 5 models, 6 domains, ACL format. Includes causal intervention analysis

点击查看摘要

Abstract:Recent work identifies a sparse set of "hallucination neurons" (H-neurons), less than 0.1% of feed-forward network neurons, that reliably predict when large language models will hallucinate. These neurons are identified on general-knowledge question answering and shown to generalize to new evaluation instances. We ask a natural follow-up question: do H-neurons generalize across knowledge domains? Using a systematic cross-domain transfer protocol across 6 domains (general QA, legal, financial, science, moral reasoning, and code vulnerability) and 5 open-weight models (3B to 8B parameters), we find they do not. Classifiers trained on one domain's H-neurons achieve AUROC 0.783 within-domain but only 0.563 when transferred to a different domain (delta = 0.220, p 0.001), a degradation consistent across all models tested. Our results suggest that hallucination is not a single mechanism with a universal neural signature, but rather involves domain-specific neuron populations that differ depending on the knowledge type being queried. This finding has direct implications for the deployment of neuron-level hallucination detectors, which must be calibrated per domain rather than trained once and applied universally.

100. 【2604.19764】Can We Locate and Prevent Stereotypes in LLMs?

链接https://arxiv.org/abs/2604.19764

作者:Alex D'Souza

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:perpetuate harmful societal, large language models, harmful societal biases, large language, perpetuate harmful

备注

点击查看摘要

Abstract:Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.

101. 【2604.19762】Evidence of Layered Positional and Directional Constraints in the Voynich Manuscript: Implications for Cipher-Like Structure

链接https://arxiv.org/abs/2604.19762

作者:Christophe Parisel

类目:Computation and Language (cs.CL)

关键词:Voynich Manuscript, resisted linguistic analysis, grapheme sequences, script of uncertain, uncertain origin

备注

点击查看摘要

Abstract:The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.19762 [cs.CL]

(or
arXiv:2604.19762v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.19762

Focus to learn more

              arXiv-issued DOI via DataCite</p>
102. 【2604.19759】Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM

链接https://arxiv.org/abs/2604.19759

作者:Mohammad AL-Smadi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:require strict adherence, persistent challenge affecting, challenge affecting patient, affecting patient safety, trials require strict

备注: Accepted for CL4Health 2026, LREC26 conference

点击查看摘要

Abstract:Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.

103. 【2604.19758】hermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

链接https://arxiv.org/abs/2604.19758

作者:Kemal Düzkar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:open-ended engineering thermodynamics, engineering thermodynamics problems, full cycle analysis, present ThermoQA, open-ended engineering

备注: 17 pages, 8 figures, open-source dataset and code

点击查看摘要

Abstract:We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at this https URL

104. 【2604.19757】ransparent Screening for LLM Inference and Training Impacts

链接https://arxiv.org/abs/2604.19757

作者:Arnault Pachot,Thierry Petit

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:current large language, transparent screening framework, large language models, limited observability, paper presents

备注

点击查看摘要

Abstract:This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environmental estimates and supports a comparative online observatory of current market models. Rather than claiming direct measurement for opaque proprietary services, it provides an auditable, source-linked proxy methodology designed to improve comparability, transparency, and reproducibility.

105. 【2604.19753】Algorithm Selection with Zero Domain Knowledge via Text Embeddings

链接https://arxiv.org/abs/2604.19753

作者:Stefan Szeider

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:propose a feature-free, selection that replaces, algorithm selection, pretrained text embeddings, replaces hand-crafted instance

备注

点击查看摘要

Abstract:We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.

106. 【2510.15339】AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction

链接https://arxiv.org/abs/2510.15339

作者:Hong Ting Tsang,Jiaxin Bai,Haoyu Huang,Qiao Xiao,Tianshi Zheng,Baixuan Xu,Shujie Liu,Yangqiu Song

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:advancing question answering, Building effective knowledge, question answering, pivotal for advancing, advancing question

备注

点击查看摘要

Abstract:Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph's functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ``good'' graphs to building demonstrably ``useful'' ones.

107. 【2604.19801】Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

链接https://arxiv.org/abs/2604.19801

作者:Gus Lathouwers,Lingyun Gao,Catia Cucchiarini,Helmer Strik

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, applications involving child, involving child speech, Automatic Speech, Speech Recognition

备注: Submitted for Interspeech 2026, currently under review

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) is increasingly used in applications involving child speech, such as language learning and literacy acquisition. However, the effectiveness of such applications is limited by high ASR error rates. The negative effects can be mitigated by identifying in advance which ASR-outputs are reliable. This work aims to develop two novel approaches for selecting reliable ASR-output at the utterance level, one for selecting reliable read speech and one for dialogue speech material. Evaluations were done on an English and a Dutch dataset, each with a baseline and finetuned model. The results show that utterance-level selection methods for identifying reliably transcribed speech recordings have high precision for the best strategy (P 97.4) for both read speech and dialogue material, for both languages. Using the current optimal strategy allows 21.0% to 55.9% of dialogue/read speech datasets to be automatically selected with low (UER of 2.6) error rates.

108. 【2604.19797】Enhancing ASR Performance in the Medical Domain for Dravidian Languages

链接https://arxiv.org/abs/2604.19797

作者:Sri Charan Devarakonda,Ravi Sastry Kolluru,Manjula Sri Rayudu,Rashmi Kapoor,Madhu G,Anil Kumar Vuppala

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, faces significant challenges, Kannada faces significant, limited annotated data, Automatic Speech

备注

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) for low-resource Dravidian languages like Telugu and Kannada faces significant challenges in specialized medical domains due to limited annotated data and morphological complexity. This work proposes a novel confidence-aware training framework that integrates real and synthetic speech data through a hybrid confidence mechanism combining static perceptual and acoustic similarity metrics with dynamic model entropy. Unlike direct fine-tuning approaches, the proposed methodology employs both fixed-weight and learnable-weight confidence aggregation strategies to guide sample weighting during training, enabling effective utilization of heterogeneous data sources. The framework is evaluated on Telugu and Kannada medical datasets containing both real recordings and TTS-generated synthetic speech. A 5-gram KenLM language model is applied for post-decoding correction. Results show that the hybrid confidence-aware approach with learnable weights substantially reduces recognition errors: Telugu Word Error Rate (WER) decreases from 24.3% to 15.8% (8.5% absolute improvement), while Kannada WER drops from 31.7% to 25.4% (6.3% absolute improvement), both significantly outperforming standard fine-tuning baselines. These findings confirm that combining adaptive confidence-aware training with statistical language modeling delivers superior performance for domain-specific ASR in morphologically complex Dravidian languages.

109. 【2604.19763】Explainable Speech Emotion Recognition: Weighted Attribute Fairness to Model Demographic Contributions to Social Bias

链接https://arxiv.org/abs/2604.19763

作者:Tomisin Ogunnubi,Yupei Li,Björn Schuller

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Speech Emotion Recognition, Speech Emotion, Emotion Recognition, systems have growing, health and education

备注: 5 pages, 4 figures

点击查看摘要

Abstract:Speech Emotion Recognition (SER) systems have growing applications in sensitive domains such as mental health and education, where biased predictions can cause harm. Traditional fairness metrics, such as Equalised Odds and Demographic Parity, often overlook the joint dependency between demographic attributes and model predictions. We propose a fairness modelling approach for SER that explicitly captures allocative bias by learning the joint relationship between demographic attributes and model error. We validate our fairness metric on synthetic data, then apply it to evaluate HuBERT and WavLM models finetuned on the CREMA-D dataset. Our results indicate that the proposed fairness model captures more mutual information between protected attributes and biases and quantifies the absolute contribution of individual attributes to bias in SSL-based SER models. Additionally, our analysis reveals indications of gender bias in both HuBERT and WavLM.

信息检索

1. 【2604.20763】Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

链接https://arxiv.org/abs/2604.20763

作者:Andrew Klearman,Radu Revutchi,Rohin Garg,Rishav Chakravarti,Samuel Marc Denton,Yuan Xue

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:retrieval-augmented generation, primary bottleneck, bottleneck for accuracy, accuracy and robustness, robustness in retrieval-augmented

备注

点击查看摘要

Abstract:Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2604.20763 [cs.IR]

(or
arXiv:2604.20763v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.20763

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2604.20598】Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

链接https://arxiv.org/abs/2604.20598

作者:Naizhong Xu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)

关键词:Modern retrieval-augmented generation, Modern retrieval-augmented, systems treat vector, context-free artifacts, retrieval-augmented generation

备注: 17 pages, 4 tables

点击查看摘要

Abstract:Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.

3. 【2604.20548】Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

链接https://arxiv.org/abs/2604.20548

作者:Shuai Chen,Chengzhi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词

备注: Scientometrics

点击查看摘要

None

4. 【2604.20490】Break the Optimization Barrier of LLM-Enhanced Recommenders: A Theoretical Analysis and Practical Framework

链接https://arxiv.org/abs/2604.20490

作者:Zhangchi Zhu,Wei Zhang

类目:Information Retrieval (cs.IR)

关键词:inference-time LLM cost, enhanced recommendation models, Large language model, exploit rich item, rich item text

备注

点击查看摘要

Abstract:Large language model (LLM)-enhanced recommendation models inject LLM representations into backbone recommenders to exploit rich item text without inference-time LLM cost. However, we find that existing LLM-enhanced methods significantly hinder the optimization of backbone models, resulting in high training losses that are difficult to reduce. To address it, we establish a comprehensive theoretical analysis of local optimization curvature and identify two key causes: 1) large norm disparity and 2) semantic-collaboration misaligned angular clustering of LLM representations. Guided by these insights, we propose Training-Friendly LLM-Enhanced Recommender (TF-LLMER), a lightweight framework with two key components. First, we highlight the necessity of item embedding normalization to eliminate norm-driven instability and achieve provable control over optimization conditioning. Second, we introduce Rec-PCA, a recommendation-aware dimensionality reduction method that injects collaborative structure into the representation transformation to resolve semantic-collaboration misaligned angular clustering. It jointly optimizes semantic information retention and alignment with an item-item co-occurrence graph constructed from interaction histories. The graph captures collaborative structure, and alignment is promoted by penalizing total variation over the graph. Both theory and extensive experiments demonstrate that TF-LLMER significantly outperforms state-of-the-art methods. Our code is available at this https URL.

5. 【2604.20462】Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin

链接https://arxiv.org/abs/2604.20462

作者:Ali Hassaan Mughal,Noor Fatima,Muhammad Bilal

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:suites accumulate step-text, accumulate step-text duplication, Behaviour-Driven Development, Toggle, Toggle Hugging Face

备注: 39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at [this https URL](https://github.com/amughalbscs16/cukereuse-release) under Apache-2.0

点击查看摘要

Abstract:Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.

Comments:
39 pages, 9 figures, 8 tables. Under review at Software Quality Journal. Tool, corpus, labelled benchmark, and rubric released at this https URL under Apache-2.0

Subjects:

Software Engineering (cs.SE); Computation and Language (cs.CL); Information Retrieval (cs.IR)

ACMclasses:
D.2.5; D.2.7; I.2.7

Cite as:
arXiv:2604.20462 [cs.SE]

(or
arXiv:2604.20462v1 [cs.SE] for this version)

https://doi.org/10.48550/arXiv.2604.20462

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ali Hassaan Mughal [view email] [v1]
Wed, 22 Apr 2026 11:44:05 UTC (240 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Finding Duplicates in 1.1M BDD Steps: cukereuse, a Paraphrase-Robust Static Detector for Cucumber and Gherkin, by Ali Hassaan Mughal and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.SE

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs
cs.CL
cs.IR

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

6. 【2604.20452】HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

链接https://arxiv.org/abs/2604.20452

作者:Peng Peng,Weiwei Lin,Wentai Wu,Xinyang Wang,Yongheng Liu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:large language models, Retrieval-Augmented Generation, retrieving external documents, language models, boundary of large

备注: Accepted by ICDE 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: this https URL.

7. 【2604.20434】Discrete Preference Learning for Personalized Multimodal Generation

链接https://arxiv.org/abs/2604.20434

作者:Yuting Zhang,Ying Sun,Dazhong Shen,Ziwei Xie,Feng Liu,Changwang Zhang,Xiang Liu,Jun Wang,Hui Xiong

类目:Information Retrieval (cs.IR)

关键词:personalized multimodal generation, generative models enables, dedicated preference model, enables the creation, personalized multimodal

备注: be accepted to SIGIR 2026

点击查看摘要

Abstract:The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.

8. 【2604.20417】Semantic Recall for Vector Search

链接https://arxiv.org/abs/2604.20417

作者:Leonardo Kuffo,Ioanna Tsakalidou,Roberta De Viti,Albert Angel,Jiří Iša,Rastislav Lenhardt

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:exact nearest neighbor, nearest neighbor search, Semantic Recall, nearest neighbor, semantically relevant objects

备注: Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval

点击查看摘要

Abstract:We introduce Semantic Recall, a novel metric to assess the quality of approximate nearest neighbor search algorithms by considering only semantically relevant objects that are theoretically retrievable via exact nearest neighbor search. Unlike traditional recall, semantic recall does not penalize algorithms for failing to retrieve objects that are semantically irrelevant to the query, even if those objects are among their nearest neighbors. We demonstrate that semantic recall is particularly useful for assessing retrieval quality on queries that have few relevant results among their nearest neighbors-a scenario we uncover to be common within embedding datasets. Additionally, we introduce Tolerant Recall, a proxy metric that approximates semantic recall when semantically relevant objects cannot be identified. We empirically show that our metrics are more effective indicators of retrieval quality, and that optimizing search algorithms for these metrics can lead to improved cost-quality tradeoffs.

9. 【2604.20146】SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

链接https://arxiv.org/abs/2604.20146

作者:Jielong Tang,Xujie Yuan,Jiayang Liu,Jianxing Yu,Xiao Dong,Lin Chen,Yunlai Teng,Shimin Di,Jian Yin

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:Named Entity Recognition, Grounded Multimodal Named, Multimodal Named Entity, Entity Recognition, Named Entity

备注: 23 pages, 12 figures

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.

10. 【2604.20135】AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

链接https://arxiv.org/abs/2604.20135

作者:Biao Zhang,Lixin Chen,Bin Zhang,Zongwei Wang,Tong Liu,Bo Zheng

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Representation Learning, Multimodal Large Language, representation, Learning, Retrieval-aware Attribute Reinforcement

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM's attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.

11. 【2604.20065】From Hidden Profiles to Governable Personalization: Recommender Systems in the Age of LLM Agents

链接https://arxiv.org/abs/2604.20065

作者:Jiahao Liu,Mingzhe Han,Guanming Liu,Weihang Wang,Dongsheng Li,Hansu Gu,Peng Zhang,Tun Lu,Ning Gu

类目:Information Retrieval (cs.IR)

关键词:remain largely inaccessible, people they describe, platform-specific user models, traditionally depended, depended on platform-specific

备注: 6 pages, under review

点击查看摘要

Abstract:Personalization has traditionally depended on platform-specific user models that are optimized for prediction but remain largely inaccessible to the people they describe. As LLM-based assistants increasingly mediate search, shopping, travel, and content access, this arrangement may be giving way to a new personalization stack in which user representation is no longer confined to isolated platforms. In this paper, we argue that the key issue is not simply that large language models can enhance recommendation quality, but that they reconfigure where and how user representations are produced, exposed, and acted upon. We propose a shift from hidden platform profiling toward governable personalization, where user representations may become more inspectable, revisable, portable, and consequential across services. Building on this view, we identify five research fronts for recommender systems: transparent yet privacy-preserving user modeling, intent translation and alignment, cross-domain representation and memory design, trustworthy commercialization in assistant-mediated environments, and operational mechanisms for ownership, access, and accountability. We position these not as isolated technical challenges, but as interconnected design problems created by the emergence of LLM agents as intermediaries between users and digital platforms. We argue that the future of recommender systems will depend not only on better inference, but on building personalization systems that users can meaningfully understand, shape, and govern.

12. 【2604.19899】A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.19899

作者:Gabriel Iturra-Bocaz,Petra Galuscakova

类目:Information Retrieval (cs.IR)

关键词:Retrieval Augmented Generation, multi-hop question answering, tackle complex tasks, Augmented Generation, Metacognitive Retrieval Augmented

备注: Paper accepted at ACM SIGIR Conference 2026

点击查看摘要

Abstract:Recently, Retrieval Augmented Generation (RAG) has shifted focus to multi-retrieval approaches to tackle complex tasks such as multi-hop question answering. However, these systems struggle to decide when to stop searching once enough information has been gathered. To address this, \citet{zhou2024metacognitive} introduced Metacognitive Retrieval Augmented Generation (MetaRAG), a framework inspired by metacognition that enables Large Language Models to critique and refine their reasoning. In this reproducibility paper, we reproduce MetaRAG following its original experimental setup and extend it in two directions: (i) by evaluating the effect of PointWise and ListWise rerankers, and (ii) by comparing with SIM-RAG, which employs a lightweight critic model to stop retrieval. Our results confirm MetaRAG's relative improvements over standard RAG and reasoning-based baselines, but also reveal lower absolute scores than reported, reflecting challenges with closed-source LLM updates, missing implementation details, and unreleased prompts. We show that MetaRAG is partially reproduced, gains substantially from reranking, and is more robust than SIM-RAG when extended with additional retrieval features.

13. 【2604.19859】DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

链接https://arxiv.org/abs/2604.19859

作者:Venus Team,Sunhao Dai,Yong Deng,Jinzhen Lin,Yusheng Song,Guoqing Wang,Xiaofeng Wu,Yuqi Zhou,Shuo Yang,Zhenzhe Ying,Zhanwei Zhang,Changhua Meng,Weiqiang Wang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:deep research, real-world deployment due, Edge-scale deep research, advantages in cost, research

备注: Technical Report of DR-Venus

点击查看摘要

Abstract:Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.

14. 【2604.19793】SkillGraph: Graph Foundation Priors for LLM Agent Tool Sequence Recommendation

链接https://arxiv.org/abs/2604.19793

作者:Hao Liu,Dongyu Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:large API libraries, large API, API libraries, order them correctly, libraries and order

备注

点击查看摘要

Abstract:LLM agents must select tools from large API libraries and order them correctly. Existing methods use semantic similarity for both retrieval and ordering, but ordering depends on inter-tool data dependencies that are absent from tool descriptions. As a result, semantic-only methods can produce negative Kendall-$\tau$ in structured workflow domains. We introduce SkillGraph, a directed weighted execution-transition graph mined from 49,831 successful LLM agent trajectories, which encodes workflow-precedence regularities as a reusable graph foundation prior. Building on this graph foundation prior, we propose a two-stage decoupled framework: GS-Hybrid retrieval for candidate selection and a learned pairwise reranker for ordering. On ToolBench (9,965 test instances; ~16,000 tools), the method reaches Set-F1 = 0.271 and Kendall-$\tau$ = 0.096; on API-Bank, Kendall-$\tau$ improves from -0.433 to +0.613. Under identical Stage-1 inputs, the learned reranker also outperforms LLaMA-3.1-8B Stage-2 rerankers.

15. 【2604.19777】Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

链接https://arxiv.org/abs/2604.19777

作者:Hung Ming Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, processing long input, window receives substantially, long input contexts, context window receives

备注: 18 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit a well-documented positional bias when processing long input contexts: information in the middle of a context window receives substantially less attention than content at the boundaries, a phenomenon termed the Lost-in-the-Middle effect (Liu et al., 2024). This limits knowledge-retrieval applications that embed large structured knowledge bases directly in the LLM context. Retrieval-Augmented Generation (RAG) addresses scalability by retrieving only relevant fragments, but introduces substantial infrastructure overhead and is ill-suited to libraries whose semantic boundaries are human-defined rather than statistically learned. We propose Self-Describing Structured Retrieval (SDSR), a lightweight framework in which structured data files embed human-authored navigational metadata at the file's primacy position, thereby exploiting rather than fighting the LLM's primacy bias. We further propose a Dual-Layer Guidance strategy combining in-file metadata with explicit routing rules in the system prompt. We validate SDSR through a four-round benchmark using a 190-skill library expanded from 36 to 119 categories via adversarial distractor injection. Four conditions are tested: (A) no guidance, (B) in-file summary only, (C) prompt hint only, (D) both combined. Version D achieves 100% primary routing accuracy (20/20) at 119 categories versus 65% for the no-guidance baseline. We identify a fundamental asymmetry: primary routing is solvable by explicit rules, while secondary cross-category routing requires architectural intent explicitly encoded in the data structure. We further extend SDSR to semi-structured corpora, showing how cross-reference encoding enables operation without vector databases in domains with recoverable document structure.

Comments:
18 pages, 6 figures, 7 tables

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2604.19777 [cs.CL]

(or
arXiv:2604.19777v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.19777

Focus to learn more

              arXiv-issued DOI via DataCite</p>
16. 【2604.19771】Cognis: Context-Aware Memory for Conversational AI Agents

链接https://arxiv.org/abs/2604.19771

作者:Parshva Daftari,Khush Patel,Shreyas Kapale,Jithin George,Siva Surendira

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:LLM agents lack, agents lack persistent, lack persistent memory, LLM agents, causing conversations

备注: 30 pages, 8 figures, 11 tables

点击查看摘要

Abstract:LLM agents lack persistent memory, causing conversations to reset each session and preventing personalization over time. We present Lyzr Cognis, a unified memory architecture for conversational AI agents that addresses this limitation through a multi-stage retrieval pipeline. Cognis combines a dual-store backend pairing OpenSearch BM25 keyword matching with Matryoshka vector similarity search, fused via Reciprocal Rank Fusion. Its context-aware ingestion pipeline retrieves existing memories before extraction, enabling intelligent version tracking that preserves full memory history while keeping the store consistent. Temporal boosting enhances time-sensitive queries, and a BGE-2 cross-encoder reranker refines final result quality. We evaluate Cognis on two independent benchmarks -- LoCoMo and LongMemEval -- across eight answer generation models, demonstrating state-of-the-art performance on both. The system is open-source and deployed in production serving conversational AI applications.

计算机视觉

1. 【2604.20841】DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

链接https://arxiv.org/abs/2604.20841

作者:Hyeonwoo Kim,Jeonghwan Kim,Kyungwon Cho,Hanbyul Joo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, motion capture systems, including complex dexterous, capture systems, complex dexterous manipulations

备注: Project Page: [this https URL](https://snuvclab.github.io/devi/)

点击查看摘要

Abstract:Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.

2. 【2604.20825】FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

链接https://arxiv.org/abs/2604.20825

作者:Sina Gholami,Abdulmoneam Ali,Tania Haghighi,Ahmed Arafa,Minhaj Nur Alam

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)

关键词:sharing raw data, collaborative model training, enables collaborative model, raw data, collaborative model

备注: Accepted at the 5th Workshop on Federated Learning for Computer Vision (FedVision), CVPR 2026. Sina Gholami and Abdulmoneam Ali contributed equally

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at this https URL.

Comments:
Accepted at the 5th Workshop on Federated Learning for Computer Vision (FedVision), CVPR 2026. Sina Gholami and Abdulmoneam Ali contributed equally

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)

Cite as:
arXiv:2604.20825 [cs.LG]

(or
arXiv:2604.20825v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.20825

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
3. 【2604.20822】Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

链接https://arxiv.org/abs/2604.20822

作者:Thorsten Hoeser,Felix Bachofer,Claudia Kuenzer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:offshore wind infrastructure, wind energy sector, offshore wind energy, offshore wind, Earth Observation based

备注: 25 pages, 16 figures

点击查看摘要

Abstract:The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

4. 【2604.20816】ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

链接https://arxiv.org/abs/2604.20816

作者:Shelly Golan,Michael Finkelson,Ariel Bereslavsky,Yotam Nitzan,Or Patashnik

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Learning, standard for aligning, methods rely, single scalar reward, Reinforcement

备注: Project page: [this https URL](https://shelley-golan.github.io/ParetoSlider-webpage/)

点击查看摘要

Abstract:Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

5. 【2604.20813】Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

链接https://arxiv.org/abs/2604.20813

作者:Yonatan Haile Medhanie,Yuanhua Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformer-based OCR models, African syllabic writing, Transformer-based OCR, systems remains limited, Latin and CJK

备注: Code and models available at [this https URL](https://github.com/YoHa2024NKU/Tigrinya_TrOCR_Printed) Pre-trained models: [this https URL](https://huggingface.co/Yonatanhaile2026/tigrinya-trocrprinted) , [this https URL](https://huggingface.co/Yonatanhaile2026/tigrinya-trocrhandwritten)

点击查看摘要

Abstract:Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.

6. 【2604.20806】OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

链接https://arxiv.org/abs/2604.20806

作者:Qiguang Chen,Chengyu Luan,Jiajun Wu,Qiming Yu,Yi Yang,Yizhuo Li,Jingqi Tong,Xiachong Feng,Libo Qin,Wanxiang Che

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large vision-language models, made substantial advances, Large vision-language, made substantial, substantial advances

备注: ACL 2026 Camera Ready

点击查看摘要

Abstract:Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

7. 【2604.20800】LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

链接https://arxiv.org/abs/2604.20800

作者:Dimitrije Antić,Alvaro Budria,George Paschalidis,Sai Kumar Dwivedi,Dimitrios Tzionas

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:perceptive systems, RGB image, essential for perceptive, Reconstructing, RGB

备注: 26 pages, 11 figures, 4 tables. Project page: [this https URL](https://anticdimi.github.io/lexis)

点击查看摘要

Abstract:Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code models will be public at this https URL.

8. 【2604.20796】LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

链接https://arxiv.org/abs/2604.20796

作者:Inclusion AI,Tiwei Bie,Haoxing Chen,Tieyuan Chen,Zhenglin Cheng,Long Cui,Kai Gan,Zhicheng Huang,Zhenzhong Lan,Haoquan Li,Jianguo Li,Tao Lin,Qi Qin,Hongjun Wang,Xiaomei Wang,Haoyuan Wu,Yi Xin,Junbo Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natively integrated framework, diffusion large language, large language model, discrete diffusion large, integrated framework

备注: LLaDA2.0-Uni Technical Report

点击查看摘要

Abstract:We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at this https URL.

9. 【2604.20784】GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction

链接https://arxiv.org/abs/2604.20784

作者:Zhenlong Wu,Zihan Zheng,Xuanxuan Wang,Qianhe Wang,Hua Yang,Xiaoyun Zhang,Qiang Hu,Wenjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sparse multi-view videos, scenes from sparse, highly ill-posed, floating artifacts, Reconstructing dynamic

备注

点击查看摘要

Abstract:Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.

10. 【2604.20760】Exploring High-Order Self-Similarity for Video Understanding

链接https://arxiv.org/abs/2604.20760

作者:Manjin Kim,Heeseung Kwon,Karteek Alahari,Minsu Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:captures visual correspondences, Space-time self-similarity, correspondences across frames, captures visual, visual correspondences

备注

点击查看摘要

Abstract:Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.

11. 【2604.20748】Amodal SAM: A Unified Amodal Segmentation Framework with Generalization

链接https://arxiv.org/abs/2604.20748

作者:Bo Zhang,Zhuotao Tian,Xin Tao,Songlin Tang,Jun Yu,Wenjie Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complete geometric shape, Amodal SAM, Amodal segmentation, Amodal, aims to predict

备注

点击查看摘要

Abstract:Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.

12. 【2604.20745】Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems

链接https://arxiv.org/abs/2604.20745

作者:Beining Wu,Jun Huang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:extended mission lifecycles, evolving terrain types, Federated continual learning, distributed autonomous fleets, continual learning

备注: Submitted to IEEE

点击查看摘要

Abstract:Federated continual learning (FCL) allows distributed autonomous fleets to adapt collaboratively to evolving terrain types across extended mission lifecycles. However, current approaches face several key challenges: 1) they use uniform protection strategies that do not account for the varying sensitivities to forgetting on different network layers; 2) they focus primarily on preventing forgetting during training, without addressing the long-term effects of cumulative drift; and 3) they often depend on idealized simulations that fail to capture the real-world heterogeneity present in distributed fleets. In this paper, we propose a lifecycle-aware dual-timescale FCL framework that incorporates training-time (pre-forgetting) prevention and (post-forgetting) recovery. Under this framework, we design a layer-selective rehearsal strategy that mitigates immediate forgetting during local training, and a rapid knowledge recovery strategy that restores degraded models after long-term cumulative drift. We present a theoretical analysis that characterizes heterogeneous forgetting dynamics and establishes the inevitability of long-term degradation. Our experimental results show that this framework achieves up to 8.3\% mIoU improvement over the strongest federated baseline and up to 31.7\% over conventional fine-tuning. We also deploy the FCL framework on a real-world rover testbed to assess system-level robustness under realistic constraints; the testing results further confirm the effectiveness of our FCL design.

13. 【2604.20730】Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

链接https://arxiv.org/abs/2604.20730

作者:Guotao Liang,Zhangcheng Wang,Juncheng Hu,Haitao Zhou,Ziteng Xue,Jing Zhang,Dong Xu,Qian Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Scalable Vector Graphics, generating Scalable Vector, Multimodal Large Language, Large Language Models, Vector Graphics

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.

14. 【2604.20715】GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

链接https://arxiv.org/abs/2604.20715

作者:Yuxuan Xue,Ruofan Liang,Egor Zakharov,Timur Bagautdinov,Chen Cao,Giljoo Nam,Shunsuke Saito,Gerard Pons-Moll,Javier Romero

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image ambiguously entangles, intrinsic appearance, image ambiguously, ambiguously entangles, single photo

备注: CVPR 2026 Highlight; Project page: [this https URL](https://yuxuan-xue.com)

点击查看摘要

Abstract:Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

15. 【2604.20705】SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

链接https://arxiv.org/abs/2604.20705

作者:Jiahao Xie,Alessio Tonioni,Nathalie Rauschmayr,Federico Tombari,Bernt Schiele

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement learning, large language models, RLVR, multimodal large language, demonstrated the great

备注

点击查看摘要

Abstract:Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: this https URL.

16. 【2604.20696】R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

链接https://arxiv.org/abs/2604.20696

作者:Jiahao Xie,Alessio Tonioni,Nathalie Rauschmayr,Federico Tombari,Bernt Schiele

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, demonstrated impressive performance, Large vision-language, reasoning tasks, demonstrated impressive

备注

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: this https URL.

17. 【2604.20665】he Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

链接https://arxiv.org/abs/2604.20665

作者:Karan Goyal,Dikshant Kukreja

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:current VLMs faithfully, VLMs faithfully synthesise, faithfully synthesise multimodal, current VLMs, VLMs faithfully

备注

点击查看摘要

Abstract:The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

18. 【2604.20650】MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

链接https://arxiv.org/abs/2604.20650

作者:Yang Luo,Yan Gong,Yongsheng Gao,Xiaoying Sun,Jie Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cluttered scenes remains, scenes remains challenging, remains challenging due, sensor noise, estimation in cluttered

备注

点击查看摘要

Abstract:6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

19. 【2604.20623】RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

链接https://arxiv.org/abs/2604.20623

作者:Roie Kazoom,Yotam Gigi,George Leifman,Tomer Shekel,Genady Beryozkin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Traditional change detection, change detection identifies, remote sensing change, Traditional change, natural language

备注

点击查看摘要

Abstract:Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at this https URL.

20. 【2604.20606】Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

链接https://arxiv.org/abs/2604.20606

作者:Fady Ibrahim,Guangjun Liu,Guanghui Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:input signals remain, signals remain constant, state space model, Vision Mamba, Vision Mamba framework

备注

点击查看摘要

Abstract:Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.

21. 【2604.20594】Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging

链接https://arxiv.org/abs/2604.20594

作者:Qian Chen,Yuehao Chen,Qiang Wang,Lei Zhu,Yanye Lu,Qiushi Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:noninvasive optical modality, monitoring retinal blood, blood flow dynamics, speckle contrast imaging, Retinal laser speckle

备注

点击查看摘要

Abstract:Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at this https URL.

22. 【2604.20591】Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound

链接https://arxiv.org/abs/2604.20591

作者:Keli Niu,He Zhao,Qianhui Men

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fetal growth restriction, identifying fetal growth, blind-sweep ultrasound, low-resource settings, growth restriction

备注

点击查看摘要

Abstract:In low-resource settings, blind-sweep ultrasound provides a practical and accessible method for identifying fetal growth restriction. However, unlike freehand ultrasound which is subjectively controlled, detection of biometry plane in blind-sweep ultrasound is more challenging due to the uncontrolled fetal structure to be observed and the variaties of oblique planes in the scan. In this work, we propose a structure-augmented system to detect fetal abdomen plane, where the abdominal structure is highlighted using a segmentation prior. Since standard planes are emerging gradually, the decision boundary of the keyframes is unstable to predict. We thus aggregated the structure-augmented planes with a temporal sliding window to help stabilise keyframe localisation. Extensive results indicate that the structure-augmented temporal sliding strategy significantly improves and stabilises the detection of anatomically meaningful planes, which enables more reliable biometric measurements in blind-sweep ultrasound.

23. 【2604.20585】On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection

链接https://arxiv.org/abs/2604.20585

作者:Eduarda Caldeira,Guray Ozgur,Fadi Boutros,Naser Damer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unconstrained image capture, image capture scenarios, morphing attack detection, image background correction, European Entry

备注: Accepted at FG 2026

点击查看摘要

Abstract:This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.

24. 【2604.20574】Where are they looking in the operating room?

链接https://arxiv.org/abs/2604.20574

作者:Keqi Chen,Séraphin Baributsa,Lilien Schewski,Vinkle Srivastav,Didier Mutter,Guido Beldi,Sandra Keller,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual attention modeling, social scene understanding, team communication detection, team communication, visual attention

备注

点击查看摘要

Abstract:Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. Methods: We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Results: Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. Conclusion: We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions.

25. 【2604.20570】Exploring Spatial Intelligence from a Generative Perspective

链接https://arxiv.org/abs/2604.20570

作者:Muzhi Zhu,Shunyao Jiang,Huanyi Zheng,Zekai Luo,Hao Zhong,Anzhou Li,Kaijun Wang,Jintao Rong,Yang Liu,Hao Chen,Tao Lin,Chunhua Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, large language models, current benchmarks largely, benchmarks largely assess, large language

备注: Accepted by CVPR 2026. Project page: [this https URL](https://aim-uofa.github.io/GSI-Bench/)

点击查看摘要

Abstract:Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

26. 【2604.20544】Evian: Towards Explainable Visual Instruction-tuning Data Auditing

链接https://arxiv.org/abs/2604.20544

作者:Zimu Jia,Mingjie Xu,Andrew Estornell,Jiaheng Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, efficacy of Large, Large Vision-Language, requiring a precise, instruction-following capability

备注: Accepted at ACL 2026

点击查看摘要

Abstract:The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

27. 【2604.20543】RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

链接https://arxiv.org/abs/2604.20543

作者:Guyue Hu,Hao Song,Yuxing Tong,Duzhi Yuan,Dengdi Sun,Aihua Zheng,Chenglong Li,Jin Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:growing research interests, recently attracted growing, attracted growing research, Referring detection, ground referring detection

备注

点击查看摘要

Abstract:Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.

28. 【2604.20522】From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

链接https://arxiv.org/abs/2604.20522

作者:Nan Xu,Shiheng Li,Shengchao Hou

类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词:two-stage Optical Music, Optical Music Recognition, Optical Music, practical two-stage Optical, two-stage Optical

备注: 49 pages, 16 figures, 16 tables

点击查看摘要

Abstract:We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

29. 【2604.20511】CHASM: Unveiling Covert Advertisements on Chinese Social Media

链接https://arxiv.org/abs/2604.20511

作者:Jingyi Zheng,Tianyi Hu,Yule Liu,Zhen Sun,Zongmin Zhang,Zifan Peng,Wenhan Dong,Xinlei He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:large language models, evaluating large language, Multimodal Large Language, moderation completely overlook, language models

备注: NeuIPS 2025 (Datasets and Benchmarks Track)

点击查看摘要

Abstract:Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly this http URL results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert this http URL further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual this http URL provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.

30. 【2604.20486】ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

链接https://arxiv.org/abs/2604.20486

作者:Wentao Yan,Shengqin Wang,Huichi Zhou,Yihang Chen,Kun Shao,Yuan Xie,Zhizhong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:knowledge-intensive visual reasoning, knowledge-intensive visual, visual reasoning, reasoning is fundamentally, fundamentally hindered

备注

点击查看摘要

Abstract:Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.20486 [cs.CV]

(or
arXiv:2604.20486v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.20486

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
31. 【2604.20474】Random Walk on Point Clouds for Feature Detection

链接https://arxiv.org/abs/2604.20474

作者:Yuhe Zhang,Zhikun Tu,Zhi Li,Jian Gao,Bao Guo,Shunli Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cloud processing tasks, numerous point cloud, point cloud processing, cloud processing, critical importance

备注: 20 pages, 11 figures. Published in Information Sciences

点击查看摘要

Abstract:The points on the point clouds that can entirely outline the shape of the model are of critical importance, as they serve as the foundation for numerous point cloud processing tasks and are widely utilized in computer graphics and computer-aided design. This study introduces a novel method, RWoDSN, for extracting such feature points, incorporating considerations of sharp-to-smooth transitions, large-to-small scales, and textural-to-detailed features. We approach feature extraction as a two-stage context-dependent analysis problem. In the first stage, we propose a novel neighborhood descriptor, termed the Disk Sampling Neighborhood (DSN), which, unlike traditional spatially and geometrically invariant approaches, preserves a matrix structure while maintaining normal neighborhood relationships. In the second stage, a random walk is performed on the DSN (RWoDSN), yielding a graph-based DSN that simultaneously accounts for the spatial distribution, topological properties, and geometric characteristics of the local surface surrounding each point. This enables the effective extraction of feature points. Experimental results demonstrate that the proposed RWoDSN method achieves a recall of 0.769-22% higher than the current state-of-the-art-alongside a precision of 0.784. Furthermore, it significantly outperforms several traditional and deep-learning techniques across eight evaluation metrics.

32. 【2604.20473】Video-ToC: Video Tree-of-Cue Reasoning

链接https://arxiv.org/abs/2604.20473

作者:Qizhong Tan,Zhuotao Tian,Guangming Lu,Jun Yu,Wenjie Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing Video Large, Large Language Models, Video Large Language, Large Language, exhibiting limited reasoning

备注

点击查看摘要

Abstract:Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at this https URL.

33. 【2604.20470】DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

链接https://arxiv.org/abs/2604.20470

作者:Yongji Long,Shijun Liang,Jintao Li,Yun Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural spatiotemporal energy, spatiotemporal energy decay, video diffusion offers, rigid static masks, static masks risks

备注

点击查看摘要

Abstract:Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at this https URL.

34. 【2604.20460】CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

链接https://arxiv.org/abs/2604.20460

作者:Xingcheng Zhou,Hao Guo,Rui Song,Walter Zimmer,Mingyu Liu,André Schamschurko,Hu Cao,Alois Knoll

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Safety-critical traffic reasoning, detect true hazards, traffic reasoning requires, near-identical counterfactual scenes, reasoning requires contrastive

备注

点击查看摘要

Abstract:Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.

35. 【2604.20429】Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

链接https://arxiv.org/abs/2604.20429

作者:Xi Chen,Xu Chen,Xiangyang Jia,Xu Zhang,Shuquan Wei,Wei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote sensing, image-text retrieval plays, plays a critical, critical role, role in understanding

备注

点击查看摘要

Abstract:Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

36. 【2604.20395】SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

链接https://arxiv.org/abs/2604.20395

作者:Chris Choy,Junha Lee,Chunghyun Park,Minsu Cho,Jan Kautz

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:aggregate foundation-model outputs, pipelines aggregate foundation-model, approaches rely, core capability, capability for robotics

备注: Project page: [this https URL](https://nvlabs.github.io/SpaCeFormer/)

点击查看摘要

Abstract:Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

37. 【2604.20393】MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

链接https://arxiv.org/abs/2604.20393

作者:Haoyu Zhang,Jingyi Zhou,Peng Ye,Jiakang Yuan,Lin Zhang,Feng Xu,Tao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant progress, significant progress due, ViT-based stereo matching, deep learning, stereo matching methods

备注

点击查看摘要

Abstract:With the development of deep learning, ViT-based stereo matching methods have made significant progress due to their remarkable robustness and zero-shot ability. However, due to the limitations of ViTs in handling resolution sensitivity and their relative neglect of local information, the ability of ViT-based methods to predict details and handle arbitrary-resolution images is still weaker than that of CNN-based methods. To address these shortcomings, we propose MLG-Stereo, a systematic pipeline-level design that extends global modeling beyond the encoder stage. First, we propose a Multi-Granularity Feature Network to effectively balance global context and local geometric information, enabling comprehensive feature extraction from images of arbitrary resolution and bridging the gap between training and inference scales. Then, a Local-Global Cost Volume is constructed to capture both locally-correlated and global-aware matching information. Finally, a Local-Global Guided Recurrent Unit is introduced to iteratively optimize the disparity locally under the guidance of global information. Extensive experiments are conducted on multiple benchmark datasets, demonstrating that our MLG-Stereo exhibits highly competitive performance on the Middlebury and KITTI-2015 benchmarks compared to contemporaneous leading methods, and achieves outstanding results in the KITTI-2012 dataset.

38. 【2604.20392】Self-supervised pretraining for an iterative image size agnostic vision transformer

链接https://arxiv.org/abs/2604.20392

作者:Nedyalko Prisadnikov,Danda Pani Paudel,Yuqian Fu,Luc Van Gool

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dominate self-supervised learning, dominate self-supervised, self-supervised learning, Abstract, ViTs

备注

点击查看摘要

Abstract:Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

39. 【2604.20368】LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

链接https://arxiv.org/abs/2604.20368

作者:Zhe Feng,Sen Lian,Changwei Wang,Muyang Zhang,Tianlong Tan,Rongtao Xu,Weiliang Meng,Xiaopeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-resolution vision tasks, vision tasks, softmax attention presents, presents a major, major obstacle

备注

点击查看摘要

Abstract:The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nyström approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.

40. 【2604.20366】Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

链接https://arxiv.org/abs/2604.20366

作者:Xingyu Zhu,Junfeng Fang,Shuo Wang,Beier Zhu,Zhicai Wang,Yonghui Yang,Xiangnan He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, compromise output reliability, Large Vision-Language, Vision-Language Models, exhibit powerful generative

备注: ACL 2026 (Oral)

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) exhibit powerful generative capabilities but frequently produce hallucinations that compromise output reliability. Fine-tuning on annotated data devoid of hallucinations offers the most direct solution, while its high computational cost motivates recent representation-based methods, which focus on mitigating hallucinatory components within hidden representations. Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. Specifically, our MPD relies on two essential factors: (1) semantic-aware component disentanglement to extract pure hallucination components, and (2) interpretable parameter updates that selectively modify parameters most relevant to hallucination. Extensive experiments demonstrate that MPD achieves state-of-the-art performance, reducing hallucinations by 23.4\% while maintaining 97.4\% of general generative capability as evaluated on LLaVA-Bench and MME, with no additional computational cost.

41. 【2604.20361】Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models

链接https://arxiv.org/abs/2604.20361

作者:Rong Quan,Yantao Lai,Dong Liang,Jie Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring-guided Scanpath Prediction, human attention scanpath, Object Referring-guided Scanpath, linguistic description describing, specific target object

备注: ICMR 2026

点击查看摘要

Abstract:Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.

42. 【2604.20358】ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

链接https://arxiv.org/abs/2604.20358

作者:Zixu Li,Yupeng Hu,Zhiwei Chen,Mingyu Zhang,Zhiheng Fu,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Image Retrieval, flexible retrieval paradigm, error-prone triplet annotations, flexible retrieval, retrieval paradigm

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly ``hard noise'' (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional ``small loss hypothesis''. We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a ``diagonal negative combination'' for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.

43. 【2604.20357】SignDATA: Data Pipeline for Sign Language Translation

链接https://arxiv.org/abs/2604.20357

作者:Kuanwei Chen,Tingyi Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:clip timing, signer framing, difficult to preprocess, preprocess consistently, vary in annotation

备注: 7 pages, 1 figure

点击查看摘要

Abstract:Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically this http URL is available at this https URL.

44. 【2604.20354】Hallucination Early Detection in Diffusion Models

链接https://arxiv.org/abs/2604.20354

作者:Federico Betti,Lorenzo Baraldi,Lorenzo Baraldi,Rita Cucchiara,Nicu Sebe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models, significant advancements, advancements in output, output realism, Hallucination Early Detection

备注: 21 pages, 6 figures, 4 tables. Published in International Journal of Computer Vision (IJCV)

点击查看摘要

Abstract:Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

45. 【2604.20350】X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

链接https://arxiv.org/abs/2604.20350

作者:Gui Wang,Zehao Zhong,YongSong Zhou,Yudong Li,Ende Wu,Wooi Ping Cheah,Rong Qu,Jianfeng Ren,Linlin Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multi-modal Large Language, multi-modal diagnosis remains, Language Models, Large Language

备注: Accept by CVPR2026

点击查看摘要

Abstract:Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can't evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: this https URL.

46. 【2604.20336】Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

链接https://arxiv.org/abs/2604.20336

作者:Jiahao Xu,Xiaohan Yuan,Xingchen Wu,Chongyang Xu,Kun Li,Buzhen Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:requires multiple humans, Co-manipulation requires multiple, preserving stable states, ensuring reasonable interactions, requires multiple

备注: CVPR 2026

点击查看摘要

Abstract:Co-manipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object's affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code is available at this https URL.

47. 【2604.20329】Image Generators are Generalist Vision Learners

链接https://arxiv.org/abs/2604.20329

作者:Valentin Gabeur,Shangbang Long,Songyou Peng,Paul Voigtlaender,Shuyang Sun,Yanan Bao,Karen Truong,Zhicheng Wang,Wenlei Zhou,Jonathan T. Barron,Kyle Genova,Nithish Kannen,Sherry Ben,Yandong Li,Mandy Guo,Suhas Yogin,Yiming Gu,Huizhong Chen,Oliver Wang,Saining Xie,Howard Zhou,Kaiming He,Thomas Funkhouser,Jean-Baptiste Alayrac,Radu Soricut

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video generators exhibit, LLMs develop emergent, Recent works show, develop emergent capabilities, vision

备注: Project Page: [this http URL](http://vision-banana.github.io)

点击查看摘要

Abstract:Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

48. 【2604.20328】Hybrid Latent Reasoning with Decoupled Policy Optimization

链接https://arxiv.org/abs/2604.20328

作者:Tao Cheng,Shi-Zhe Chen,Hao Zhang,Yixin Qin,Jinwen Luo,Zheng Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, complex problem-solving capabilities, reasoning significantly elevates, multimodal large language, language models

备注: Tech report

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at this https URL.

49. 【2604.20319】SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

链接https://arxiv.org/abs/2604.20319

作者:Gui Wang,YongSong Zhou,Kaijun Deng,Wooi Ping Cheah,Rong Qu,Jianfeng Ren,Linlin Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, remain largely unexplored

备注: Accept by CVPR2026

点击查看摘要

Abstract:Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: this https URL.

50. 【2604.20318】UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

链接https://arxiv.org/abs/2604.20318

作者:Haokun Wen,Xuemeng Song,Haoyu Zhang,Xiangyu Zhao,Weili Guan,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:multi-turn composed image, Composed image retrieval, Composed image, composed video retrieval, image retrieval

备注

点击查看摘要

Abstract:Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.

51. 【2604.20317】MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

链接https://arxiv.org/abs/2604.20317

作者:Xuan Cui,Yunfei Zhao,Bo Liu,Wei Duan,Xingrong Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:GAN-based facial attribute, face attribute unintentionally, attribute unintentionally alters, GAN-based facial, virtual avatars

备注

点击查看摘要

Abstract:GAN-based facial attribute editing is widely used in virtual avatars and social media but often suffers from attribute entanglement, where modifying one face attribute unintentionally alters others. While supervised disentangled representation learning can address this, it relies heavily on labeled data, incurring high annotation costs. To address these challenges, we propose MD-Face, a label-free disentangled representation learning framework based on Mixture of Experts (MoE). MD-Face utilizes a MoE backbone with a gating mechanism that dynamically allocates experts, enabling the model to learn semantic vectors with greater independence. To further enhance attribute entanglement, we introduce a geometry-aware loss, which aligns each semantic vector with its corresponding Semantic Boundary Vector (SBV) through a Jacobian-based pushforward method. Experiments with ProGAN and StyleGAN show that MD-Face outperforms unsupervised baselines and competes with supervised ones. Compared to diffusion-based methods, it offers better image quality and lower inference latency, making it ideal for interactive editing.

52. 【2604.20307】Improving Facial Emotion Recognition through Dataset Merging and Balanced Training Strategies

链接https://arxiv.org/abs/2604.20307

作者:Serap Kırbız

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep convolutional networks, deep learning framework, automatic facial emotion, facial emotion based, deep learning

备注

点击查看摘要

Abstract:In this paper, a deep learning framework is proposed for automatic facial emotion based on deep convolutional networks. In order to increase the generalization ability and the robustness of the method, the dataset size is increased by merging three publicly available facial emotion datasets: CK+, FER+ and KDEF. Despite the increase in dataset size, the minority classes still suffer from insufficient number of training samples, leading to data imbalance. The data imbalance problem is minimized by online and offline augmentation techniques and random weighted sampling. Experimental results demonstrate that the proposed method can recognize the seven basic emotions with 82% accuracy. The results demonstrate the effectiveness of the proposed approach in tackling the challenges of data imbalance and improving classification performance in facial emotion recognition.

53. 【2604.20306】Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

链接https://arxiv.org/abs/2604.20306

作者:Zibo Xu,Qiang Li,Ke Lu,Jin Wang,Weizhi Nie,Yuting Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Question Answering, Question Answering, generate clinically reliable, Medical Visual Question, complex medical images

备注

点击查看摘要

Abstract:Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

54. 【2604.20291】Efficient INT8 Single-Image Super-Resolution via Deployment-Aware Quantization and Teacher-Guided Training

链接https://arxiv.org/abs/2604.20291

作者:Pham Phuong Nam Nguyen,Nam Tien Le,Thi Kim Trang Vo,Nhu Tinh Anh Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficient single-image super-resolution, Efficient single-image, requires balancing reconstruction, model compactness, requires balancing

备注: 10 pages, 4 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:Efficient single-image super-resolution (SISR) requires balancing reconstruction fidelity, model compactness, and robustness under low-bit deployment, which is especially challenging for x3 SR. We present a deployment-oriented quantized SISR framework based on an extract-refine-upsample design. The student performs most computation in the low-resolution space and uses a lightweight re-parameterizable backbone with PixelShuffle reconstruction, yielding a compact inference graph. To improve quality without significantly increasing complexity, we adopt a three-stage training pipeline: Stage 1 learns a basic reconstruction mapping with spatial supervision; Stage 2 refines fidelity using Charbonnier loss, DCT-domain supervision, and confidence-weighted output-level distillation from a Mamba-based teacher; and Stage 3 applies quantization-aware training directly on the fused deploy graph. We further use weight clipping and BatchNorm recalibration to improve quantization stability. On the MAI 2026 Quantized 4K Image Super-Resolution Challenge test set, our final AIO MAI submission achieves 29.79 dB PSNR and 0.8634 SSIM, obtaining a final score of 1.8 under the target mobile INT8 deployment setting. Ablation on Stage 3 optimization shows that teacher-guided supervision improves the dynamic INT8 TFLite reconstruction from 29.91 dB/0.853 to 30.0003 dB/0.856, while the fixed-shape deployable INT8 TFLite artifact attains 30.006 dB/0.857.

55. 【2604.20289】X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference

链接https://arxiv.org/abs/2604.20289

作者:Yixiao Zeng,Jianlei Zheng,Chaoda Zheng,Shijia Chen,Mingdian Liu,Tongping Liu,Tengwei Luo,Yu Zhang,Boyang Wang,Linkun Xu,Siyuan Lu,Bo Tian,Xianming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Real-time world simulation, online reinforcement learning, autonomous driving systems, Real-time world, infrastructure for scalable

备注: Technical Report

点击查看摘要

Abstract:Real-time world simulation is becoming a key infrastructure for scalable evaluation and online reinforcement learning of autonomous driving systems. Recent driving world models built on autoregressive video diffusion achieve high-fidelity, controllable multi-camera generation, but their inference cost remains a bottleneck for interactive deployment. However, existing diffusion caching methods are designed for offline video generation with multiple denoising steps, and do not transfer to this scenario. Few-step distilled models have no inter-step redundancy left for these methods to reuse, and sequence-level parallelization techniques require future conditioning that closed-loop interactive generation does not provide. We present X-Cache, a training-free acceleration method that caches along a different axis: across consecutive generation chunks rather than across denoising steps. X-Cache maintains per-block residual caches that persist across chunks, and applies a dual-metric gating mechanism over a structure- and action-aware block-input fingerprint to independently decide whether each block should recompute or reuse its cached residual. To prevent approximation errors from permanently contaminating the autoregressive KV cache, X-Cache identifies KV update chunks (the forward passes that write clean keys and values into the persistent cache) and unconditionally forces full computation on these chunks, cutting off error propagation. We implement X-Cache on X-world, a production multi-camera action-conditioned driving world model built on multi-block causal DiT with few-step denoising and rolling KV cache. X-Cache achieves 71% block skip rate with 2.6x wall-clock speedup while maintaining minimum degradation.

56. 【2604.20286】MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

链接https://arxiv.org/abs/2604.20286

作者:Md Maklachur Rahman,Soon Ki Jung,Tracy Hammond

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent segmentation models, demonstrated promising efficiency, Mamba Feature Fusion, Recent segmentation, computational complexity

备注: Accepted at CVPR 2026 Main

点击查看摘要

Abstract:Recent segmentation models have demonstrated promising efficiency by aggressively reducing parameter counts and computational complexity. However, these models often struggle to accurately delineate fine lesion boundaries and texture patterns essential for early skin cancer diagnosis and treatment planning. In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U-Net architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local-Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). These modules are designed to enhance local-global feature interaction, preserve spatial details, and improve the quality of skip connections. MambaLiteUNet achieves an average IoU of 87.12% and average Dice score of 93.09% across ISIC2017, ISIC2018, HAM10000, and PH2 benchmarks, outperforming state-of-the-art models. Compared to U-Net, our model improves average IoU and Dice by 7.72 and 4.61 points, respectively, while reducing parameters by 93.6% and GFLOPs by 97.6%. Additionally, in domain generalization with six unseen lesion categories, MambaLiteUNet achieves 77.61% IoU and 87.23% Dice, performing best among all evaluated models. Our extensive experiments demonstrate that MambaLiteUNet achieves a strong balance between accuracy and efficiency, making it a competitive and practical solution for dermatological image segmentation. Our code is publicly available at: this https URL.

57. 【2604.20281】Fourier Series Coder: A Novel Perspective on Angle Boundary Discontinuity Problem for Oriented Object Detection

链接https://arxiv.org/abs/2604.20281

作者:Minghong Wei,Pu Cao,Zhihao Chen,Zhiyuan Zang,Lu Yang,Qing Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained widespread attention, remote sensing, widespread attention, rapid advancement, advancement of intelligent

备注

点击查看摘要

Abstract:With the rapid advancement of intelligent driving and remote sensing, oriented object detection has gained widespread attention. However, achieving high-precision performance is fundamentally constrained by the Angle Boundary Discontinuity (ABD) and Cyclic Ambiguity (CA) problems, which typically cause significant angle fluctuations near periodic boundaries. Although recent studies propose continuous angle coders to alleviate these issues, our theoretical and empirical analyses reveal that state-of-the-art methods still suffer from substantial cyclic errors. We attribute this instability to the structural noise amplification within their non-orthogonal decoding mechanisms. This mathematical vulnerability significantly exacerbates angular deviations, particularly for square-like objects. To resolve this fundamentally, we propose the Fourier Series Coder (FSC), a lightweight plug-and-play component that establishes a continuous, reversible, and mathematically robust angle encoding-decoding paradigm. By rigorously mapping angles onto a minimal orthogonal Fourier basis and explicitly enforcing a geometric manifold constraint, FSC effectively prevents feature modulus collapse. This structurally stabilized representation ensures highly robust phase unwrapping, intrinsically eliminating the need for heuristic truncations while achieving strict boundary continuity and superior noise immunity. Extensive experiments across three large-scale datasets demonstrate that FSC achieves highly competitive overall performance, yielding substantial improvements in high-precision detection. The code will be available at this https URL.

58. 【2604.20268】Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization

链接https://arxiv.org/abs/2604.20268

作者:Zhaochen Li,Xinghao Yan,Runni Zhou,Xiaoyang Li,Chenjie Zhu,Gege Wang,Yu Shi,Lixin Zhang,Rongrong Fu,Liehao Yan,Yuan Chai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fragility fractures occur, Knee radiographs, routine knee radiographs, opportunistic bone-loss screening, Toggle

备注

点击查看摘要

Abstract:Background: Osteoporosis and osteopenia are often undiagnosed until fragility fractures occur. Dual-energy X-ray absorptiometry (DXA) is the reference standard for bone mineral density (BMD) assessment, but access remains limited. Knee radiographs are obtained at high volume for osteoarthritis evaluation and may offer an opportunity for opportunistic bone-loss screening. Objective: To develop and evaluate a multi-task deep learning system for opportunistic bone-loss screening from routine knee radiographs without additional imaging or patient visits. Methods: We developed STR-Net, a multi-task framework for single-channel grayscale knee radiographs. The model includes a shared backbone, global average pooling feature aggregation, a shared neck, and a task-aware representation routing module connected to three task-specific heads: binary screening (Normal vs. Bone Loss), severity sub-classification (Osteopenia vs. Osteoporosis), and weakly coupled T-score regression with optional clinical variables. A sensitivity-constrained threshold optimization strategy (minimum sensitivity = 0.86) was applied. The dataset included 1,570 knee radiographs, split at the patient level into training (n=1,120), validation (n=226), and test (n=224) sets. Results: On the held-out test set, STR-Net achieved an AUROC of 0.933, sensitivity of 0.904, specificity of 0.773, and AUPRC of 0.956 for binary screening. Severity sub-classification achieved an AUROC of 0.898. The T-score regression branch showed a Pearson correlation of 0.801 with DXA-measured T-scores in a pilot subset (n=31), with MAE of 0.279 and RMSE of 0.347. Conclusions: STR-Net enables single-pass bone-loss screening, severity stratification, and quantitative T-score estimation from routine knee radiographs. Prospective clinical validation is needed before deployment.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.20268 [cs.CV]

(or
arXiv:2604.20268v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.20268

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Xiaoyang Li [view email] [v1]
Wed, 22 Apr 2026 07:12:04 UTC (7,393 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization, by Zhaochen Li and 10 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

59. 【2604.20258】Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

链接https://arxiv.org/abs/2604.20258

作者:Jingxuan He,Xiyu Wang,Mengyu Zheng,Xiangyu Zeng,Yunke Wang,Chang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving irrelevant content, Instruction-based image editing, Instruction-based image, aims to modify, irrelevant content

备注

点击查看摘要

Abstract:Instruction-based image editing (IIE) aims to modify images according to textual instructions while preserving irrelevant content. Despite recent advances in diffusion transformers, existing methods often suffer from over-editing, introducing unintended changes to regions unrelated to the desired edit. We identify that this limitation arises from the lack of an explicit mechanism for edit localization. In particular, different editing operations (e.g., addition, removal and replacement) induce distinct spatial patterns, yet current IIE models typically treat localization in a task-agnostic manner. To address this limitation, we propose a training-free, task-aware edit localization framework that exploits the intrinsic source and target image streams within IIE models. For each image stream, We first obtain attention-based edit cues, and then construct feature centroids based on these attentive cues to partition tokens into edit and non-edit regions. Based on the observation that optimal localization is inherently task-dependent, we further introduce a unified mask construction strategy that selectively leverages source and target image streams for different editing tasks. We provide a systematic analysis for our proposed insights and approaches. Extensive experiments on EdiVal-Bench demonstrate our framework consistently improves non-edit region consistency while maintaining strong instruction-following performance on top of powerful recent image editing backbones, including Step1X-Edit and Qwen-Image-Edit.

60. 【2604.20245】Secure Rate-Distortion-Perception: A Randomized Distributed Function Computation Approach for Realism

链接https://arxiv.org/abs/2604.20245

作者:Gustaf Åhlgren,Onur Günlü

类目:Information Theory (cs.IT); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:neural image compression, applications requiring maintained, secure RDP, secure RDP region, RDP

备注: 20 pages, 6 figures, (submitted) journal version

点击查看摘要

Abstract:Fundamental rate-distortion-perception (RDP) trade-offs arise in applications requiring maintained perceptual quality of reconstructed data, such as neural image compression. When compressed data is transmitted over public communication channels, security risks emerge. We therefore study secure RDP under negligible information leakage over both noiseless channels and broadcast channels, BCs, with correlated noise components. For noiseless channels, the exact secure RDP region is characterized. For BCs, an inner bound is derived and shown to be tight for a class of more-capable BCs. Separate source-channel coding is further shown to be optimal for this exact secure RDP region with unlimited common randomness available. Moreover, when both encoder and decoder have access to side information correlated with the source and the channel is noiseless, the exact RDP region is established. If only the decoder has correlated side information in the noiseless setting, an inner bound is derived along with a special case where the region is exact. Binary and Gaussian examples demonstrate that common randomness can significantly reduce the communication rate in secure RDP settings, unlike in standard rate-distortion settings. Thus, our results illustrate that random binning-based coding achieves strong secrecy, low distortion, and high perceptual quality simultaneously.

61. 【2604.20243】Bio-inspired Color Constancy: From Gray Anchoring Theory to Gray Pixel Methods

链接https://arxiv.org/abs/2604.20243

作者:Kai-Fu Yang,Fu-Ya Luo,Yong-Jie Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer imaging systems, Color constancy, biological visual systems, bio-inspired color constancy, Color

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Color constancy is a fundamental ability of many biological visual systems and a crucial step in computer imaging systems. Bio-inspired modeling offers a promising way to elucidate the computational principles underlying color constancy and to develop efficient computational methods. However, bio-inspired methods for color constancy remain underexplored and lack a comprehensive analysis. This paper presents a comprehensive technical framework that integrates biological mechanisms, computational theory, and algorithmic implementation for bio-inspired color constancy. Specifically, we systematically revisit the computational theory of biological color constancy, which shows that illuminant estimation can be reduced to the task of gray-anchor (pixel or surface) detection in early vision. Subsequently, typical gray-pixel detection methods, including Gray-Pixel and Grayness-Index, are reinterpreted within a unified theoretical framework with the Lambertian reflection model and biological color-opponent mechanisms. Finally, we propose a simple learning-based method that couples reflection-model constraints with feature learning to explore the potential of bio-inspired color constancy based on gray-pixel detection. Extensive experiments confirm the effectiveness of gray-pixel detection for color constancy and demonstrate the potential of bio-inspired methods.

62. 【2604.20226】Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

链接https://arxiv.org/abs/2604.20226

作者:Tianshui Chen,Jianman Lin,Zhijing Yang,Chunmei Qing,Guangrun Wang,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Speech-preserving facial expression, Speech-preserving facial, facial expression manipulation, coherent correlation metric, coherent correlation

备注

点击查看摘要

Abstract:Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-temporal coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken content. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as an additional loss to supervise the generation process.

63. 【2604.20213】Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images

链接https://arxiv.org/abs/2604.20213

作者:Juha Park,Jiho Choi,Jong Pil Yun,Yong Chan Park,Han-Gyeol Yeom,Byung Do Lee,Sang Jun Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dental imaging research, Accurate segmentation, panoramic X-ray images, surgical planning, imaging research

备注: 14 pages, 6 figures. Under review

点击查看摘要

Abstract:Accurate segmentation of maxillary sinus in panoramic X-ray images is essential for dental diagnosis and surgical planning; however, this task remains relatively underexplored in dental imaging research. Structural overlap, ambiguous anatomical boundaries inherent to two-dimensional panoramic projections, and the limited availability of large scale clinical datasets with reliable pixel-level annotations make the development and evaluation of segmentation models challenging. To address these challenges, we propose a semi-supervised segmentation framework that effectively leverages both labeled and unlabeled panoramic radiographs, where knowledge distillation is utilized to train a student model with reliable structural information distilled from a teacher model. Specifically, we introduce a weighted knowledge distillation loss to suppress unreliable distillation signals caused by structural discrepancies between teacher and student predictions. To further enhance the quality of pseudo labels generated by the teacher network, we introduce SinusCycle-GAN which is a refinement network based on unpaired image-to-image translation. This refinement process improves the precision of boundaries and reduces noise propagation when learning from unlabeled data during semi-supervised training. To evaluate the proposed method, we collected clinical panoramic X-ray images from 2,511 patients, and experimental results demonstrate that the proposed method outperforms state-of-the-art segmentation models, achieving the Dice score of 96.35\% while reducing boundary error. The results indicate that the proposed semi-supervised framework provides robust and anatomically consistent segmentation performance under limited labeled data conditions, highlighting its potential for broader dental image analysis applications.

64. 【2604.20191】From Scene to Object: Text-Guided Dual-Gaze Prediction

链接https://arxiv.org/abs/2604.20191

作者:Zehong Ke,Yanbo Jiang,Jinhao Li,Zhiyuan Liu,Yiqian Tu,Qingwen Meng,Heye Huang,Jianqiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:human-like autonomous driving, Interpretable driver attention, Interpretable driver, autonomous driving, crucial for human-like

备注

点击查看摘要

Abstract:Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

65. 【2604.20190】WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

链接https://arxiv.org/abs/2604.20190

作者:Mobin Habibpour,Niloufar Alipour Talemi,John Spodnik,Camren J. Khoury,Fatemeh Afghah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:actionable situational awareness, monitoring requires timely, visual question answering, Wildfire monitoring requires, evaluate wildfire-specific multimodal

备注

点击查看摘要

Abstract:Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at this https URL.

66. 【2604.20169】Semantic-Fast-SAM: Efficient Semantic Segmenter

链接https://arxiv.org/abs/2604.20169

作者:Byunghyun Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fast Segment, combines the Fast, achieve real-time performance, framework that combines, pipeline to achieve

备注: APSIPA ASC 2025

点击查看摘要

Abstract:We propose Semantic-Fast-SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real-time performance without sacrificing accuracy. FastSAM is an efficient CNN-based re-implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer-based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic-Segment-Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high-quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM-based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM-based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed-set setting. We also show that SFS effectively handles open-vocabulary segmentation by leveraging CLIP-based semantic heads, outperforming recent open-vocabulary models on broad class labeling. This work enables practical real-time semantic segmentation with the "segment-anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at this https URL.

67. 【2604.20157】HumanScore: Benchmarking Human Motions in Generated Videos

链接https://arxiv.org/abs/2604.20157

作者:Yusu Fang,Tiange Xiang,Tian Tan,Narayan Schuetz,Scott Delp,Li Fei-Fei,Ehsan Adeli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:producing increasingly realistic, increasingly realistic content, driven rapid progress, Recent advances, producing increasingly

备注

点击查看摘要

Abstract:Recent advances in model architectures, compute, and data scale have driven rapid progress in video generation, producing increasingly realistic content. Yet, no prior method systematically measures how faithfully these systems render human bodies and motion dynamics. In this paper, we present HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency, enabling fine-grained diagnosis beyond visual realism alone. Through carefully designed prompts, we elicit a diverse set of movements at varying intensities and evaluate videos generated by thirteen state-of-the-art models. Our analysis reveals consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifies recurrent failure modes (e.g., temporal jitter, anatomically implausible poses, and motion drift), and produces robust model rankings from quantitative and physically meaningful criteria.

68. 【2604.20155】GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds

链接https://arxiv.org/abs/2604.20155

作者:Ao Gao,Jingyu Gong,Xin Tan,Zhizhong Zhang,Yuan Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, revolutionized real-time rendering, severe geometric voids, performance degrades significantly, real-time rendering

备注

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has revolutionized real-time rendering, its performance degrades significantly under sparse-view extrapolation, manifesting as severe geometric voids and artifacts. Existing solutions primarily rely on an iterative "Repair-then-Distill" paradigm, which is inherently unstable and prone to overfitting. In this work, we propose GSCompleter, a distillation-free plugin that shifts scene completion to a stable "Generate-then-Register" workflow. Our approach first synthesizes plausible 2D reference images and explicitly lifts them into metric-scale 3D primitives via a robust Stereo-Anchor mechanism. These primitives are then seamlessly integrated into the global context through a novel Ray-Constrained Registration strategy. This shift to a rapid registration paradigm delivers superior 3DGS completion performance across three distinct benchmarks, enhancing the quality and efficiency of various baselines and achieving new SOTA results.

69. 【2604.20136】IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

链接https://arxiv.org/abs/2604.20136

作者:Weitong Kong,Di Wen,Kunyu Peng,David Schneider,Zeyun Zhong,Alexander Jaus,Zdravko Marinov,Jiale Wei,Ruiping Liu,Junwei Zheng,Yufan Chen,Lei Qi,Rainer Stiefelhagen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:existing multimodal pipelines, pipelines produce opaque, multimodal pipelines produce, revisit raw video, reconstruct temporal logic

备注: 7 pages, 2 figures, code are available at [this https URL](https://github.com/MKong17/IMPACT_CYCLE)

点击查看摘要

Abstract:Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at this https URL.

70. 【2604.20130】Pairing Regularization for Mitigating Many-to-One Collapse in GANs

链接https://arxiv.org/abs/2604.20130

作者:Kuan-Yu Lin,Yu-Chih Huang,Tie Liu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:generative adversarial networks, training generative adversarial, Mode collapse remains, adversarial networks, remains a fundamental

备注

点击查看摘要

Abstract:Mode collapse remains a fundamental challenge in training generative adversarial networks (GANs). While existing works have primarily focused on inter-mode collapse, such as mode dropping, intra-mode collapse-where many latent variables map to the same or highly similar outputs-has received significantly less attention. In this work, we propose a pairing regularizer jointly optimized with the generator to mitigate the many-to-one collapse by enforcing local consistency between latent variables and generated samples. We show that the effect of pairing regularization depends on the dominant failure mode of training. In collapse-prone regimes with limited exploration, pairing encourages structured local exploration, leading to improved coverage and higher recall. In contrast, under stabilized training with sufficient exploration, pairing refines the generator's induced data density by discouraging redundant mappings, thereby improving precision without sacrificing recall. Extensive experiments on both toy distributions and real-image benchmarks demonstrate that the proposed regularizer effectively complements existing stabilization techniques by directly addressing intra-mode collapse.

71. 【2604.20128】Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging

链接https://arxiv.org/abs/2604.20128

作者:Peiming Luo,Nan Wang,Litong Liu,Jiahan Huang,Chenxu Wu,Renwei Dian,Junming Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severely ill-posed nature, ill-posed nature remains, video-rate HR-HSI imaging, Fusing a low, mosaiced hyperspectral image

备注

点击查看摘要

Abstract:Fusing a low resolution (LR) mosaiced hyperspectral image (HSI) with a high resolution (HR) panchromatic (PAN) image offers a promising avenue for video-rate HR-HSI imaging via single-shot acquisition, yet its severely ill-posed nature remains a significant challenge. In this work, we propose a novel semi-supervised flow matching framework for mosaiced and PAN image fusion. Unlike previous diffusion-based approaches constrained by specific protocols or handcrafted assumptions, our method seamlessly integrates an unsupervised scheme with flow matching, resulting in a generalizable and efficient generative framework. Specifically, our method follows a two-stage training pipeline. First, we pretrain an unsupervised prior network to produce an initial pseudo HR-HSI. Building on this, we then train a conditional flow matching model to generate the target HR-HSI, introducing a random voting mechanism that iteratively refines the initial HR-HSI estimate, enabling robust and effective fusion. During inference, we employ a conflict-free gradient guidance strategy that ensures spectrally and spatially consistent HR-HSI reconstruction. Experiments on multiple benchmark datasets demonstrate that our method achieves superior quantitative and qualitative performance by a significant margin compared to representative baselines. Beyond mosaiced and PAN fusion, our approach provides a flexible generative framework that can be readily extended to other image fusion tasks and integrated with unsupervised or blind image restoration algorithms.

72. 【2604.20123】opology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

链接https://arxiv.org/abs/2604.20123

作者:Daoyong Fu,Xiang Zhang,Zhaohuan Zhan,Fan Yang,Ke Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:represent geometric shapes, natural images, geometric shapes, skeleton, represent geometric

备注

点击查看摘要

Abstract:In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.

73. 【2604.20093】FurnSet: Exploiting Repeats for 3D Scene Reconstruction

链接https://arxiv.org/abs/2604.20093

作者:Paul Dobre,Xin Wang,Hongzhou Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reconstruction involves inferring, involves inferring, geometry and spatial, Single-view, scene reconstruction involves

备注

点击查看摘要

Abstract:Single-view 3D scene reconstruction involves inferring both object geometry and spatial layout. Existing methods typically reconstruct objects independently or rely on implicit scene context, failing to exploit the repeated instances commonly present in realworld scenes. We propose FurnSet, a framework that explicitly identifies and leverages repeated object instances to improve reconstruction. Our method introduces per-object CLS tokens and a set-aware self-attention mechanism that groups identical instances and aggregates complementary observations across them, enabling joint reconstruction. We further combine scene-level and object-level conditioning to guide object reconstruction, followed by layout optimization using object point clouds with 3D and 2D projection losses for scene alignment. Experiments on 3D-Future and 3D-Front demonstrate improved scene reconstruction quality, highlighting the effectiveness of exploiting repetition for robust 3D scene reconstruction.

74. 【2604.20083】Energy-Based Open-Set Active Learning for Object Classification

链接https://arxiv.org/abs/2604.20083

作者:Zongyao Lyu,William J. Beksi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Active learning, minimizing labeling costs, crucial methodology, methodology for minimizing, costs in deep

备注: To be published in the 2026 International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Active learning (AL) has emerged as a crucial methodology for minimizing labeling costs in deep learning by selecting the most valuable samples from a pool of unlabeled data for annotation. Traditional AL operates under a closed-set assumption, where all classes in the dataset are known and consistent. However, real-world scenarios often present open-set conditions in which unlabeled data contains both known and unknown classes. In such environments, standard AL techniques struggle. They can mistakenly query samples from unknown categories, leading to inefficient use of annotation budgets. In this paper, we propose a novel dual-stage energy-based framework for open-set AL. Our method employs two specialized energy-based models (EBMs). The first, an energy-based known/unknown separator, filters out samples likely to belong to unknown classes. The second, an energy-based sample scorer, assesses the informativeness of the filtered known samples. Using the energy landscape, our models distinguish between data points from known and unknown classes in the unlabeled pool by assigning lower energy to known samples and higher energy to unknown samples, ensuring that only samples from classes of interest are selected for labeling. By integrating these components, our approach ensures efficient and targeted sample selection, maximizing learning impact in each iteration. Experiments on 2D (CIFAR-10, CIFAR-100, TinyImageNet) and 3D (ModelNet40) object classification benchmarks demonstrates that our framework outperforms existing approaches, achieving superior annotation efficiency and classification performance in open-set environments.

75. 【2604.20047】PASTA: A Patch-Agnostic Twofold-Stealthy Backdoor Attack on Vision Transformers

链接https://arxiv.org/abs/2604.20047

作者:Dazhuang Liu,Yanqi Qiao,Rui Wang,Kaitai Liang,Georgios Smaragdakis

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Vision Transformers, achieved remarkable success, recent studies show, trigger, achieved remarkable

备注

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success across vision tasks, yet recent studies show they remain vulnerable to backdoor attacks. Existing patch-wise attacks typically assume a single fixed trigger location during inference to maximize trigger attention. However, they overlook the self-attention mechanism in ViTs, which captures long-range dependencies across patches. In this work, we observe that a patch-wise trigger can achieve high attack effectiveness when activating backdoors across neighboring patches, a phenomenon we term the Trigger Radiating Effect (TRE). We further find that inter-patch trigger insertion during training can synergistically enhance TRE compared to single-patch insertion. Prior ViT-specific attacks that maximize trigger attention often sacrifice visual and attention stealthiness, making them detectable. Based on these insights, we propose PASTA, a twofold stealthy patch-wise backdoor attack in both pixel and attention domains. PASTA enables backdoor activation when the trigger is placed at arbitrary patches during inference. To achieve this, we introduce a multi-location trigger insertion strategy to enhance TRE. However, preserving stealthiness while maintaining strong TRE is challenging, as TRE is weakened under stealthy constraints. We therefore formulate a bi-level optimization problem and propose an adaptive backdoor learning framework, where the model and trigger iteratively adapt to each other to avoid local optima. Extensive experiments show that PASTA achieves 99.13% attack success rate across arbitrary patches on average, while significantly improving visual and attention stealthiness (144.43x and 18.68x) and robustness (2.79x) against state-of-the-art ViT defenses across four datasets, outperforming CNN- and ViT-based baselines.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

Cite as:
arXiv:2604.20047 [cs.CV]

(or
arXiv:2604.20047v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.20047

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
76. 【2604.20046】Gaussians on a Diet: High-Quality Memory-Bounded 3D Gaussian Splatting Training

链接https://arxiv.org/abs/2604.20046

作者:Yangming Zhang,Jian Xu,Kunxiong Zhu,Wei Niu,Miao Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, revolutionized novel view, view synthesis, synthesis with high-quality, continuous aggregations

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized novel view synthesis with high-quality rendering through continuous aggregations of millions of 3D Gaussian primitives. However, it suffers from a substantial memory footprint, particularly during training due to uncontrolled densification, posing a critical bottleneck for deployment on memory-constrained edge devices. While existing methods prune redundant Gaussians post-training, they fail to address the peak memory spikes caused by the abrupt growth of Gaussians early in the training process. To solve the training memory consumption problem, we propose a systematic memory-bounded training framework that dynamically optimizes Gaussians through iterative growth and pruning. In other words, the proposed framework alternates between incremental pruning of low-impact Gaussians and strategic growing of new primitives with an adaptive Gaussian compensation, maintaining a near-constant low memory usage while progressively refining rendering fidelity. We comprehensively evaluate the proposed training framework on various real-world datasets under strict memory constraints, showing significant improvements over existing state-of-the-art methods. Particularly, our proposed method practically enables memory-efficient 3DGS training on NVIDIA Jetson AGX Xavier, achieving similar visual quality with up to 80% lower peak training memory consumption than the original 3DGS.

77. 【2604.20041】Normalizing Flows with Iterative Denoising

链接https://arxiv.org/abs/2604.20041

作者:Tianrong Chen,Jiatao Gu,David Berthelot,Joshua Susskind,Shuangfei Zhai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:received revived attention, Normalizing Flows, Normalizing Flow generative, revived attention, classical family

备注

点击查看摘要

Abstract:Normalizing Flows (NFs) are a classical family of likelihood-based methods that have received revived attention. Recent efforts such as TARFlow have shown that NFs are capable of achieving promising performance on image modeling tasks, making them viable alternatives to other methods such as diffusion models. In this work, we further advance the state of Normalizing Flow generative models by introducing iterative TARFlow (iTARFlow). Unlike diffusion models, iTARFlow maintains a fully end-to-end, likelihood-based objective during training. During sampling, it performs autoregressive generation followed by an iterative denoising procedure inspired by diffusion-style methods. Through extensive experiments, we show that iTARFlow achieves competitive performance across ImageNet resolutions of 64, 128, and 256 pixels, demonstrating its potential as a strong generative model and advancing the frontier of Normalizing Flows. In addition, we analyze the characteristic artifacts produced by iTARFlow, offering insights that may shed light on future improvements. Code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.20041 [cs.CV]

(or
arXiv:2604.20041v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.20041

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
78. 【2604.20038】FluSplat: Sparse-View 3D Editing without Test-Time Optimization

链接https://arxiv.org/abs/2604.20038

作者:Haitao Huang,Shin-Fang Chng,Huangying Zhan,Qingan Yan,Yi Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, Recent advances, enabled high-quality, advances in text-guided, scene manipulation

备注

点击查看摘要

Abstract:Recent advances in text-guided image editing and 3D Gaussian Splatting (3DGS) have enabled high-quality 3D scene manipulation. However, existing pipelines rely on iterative edit-and-fit optimization at test time, alternating between 2D diffusion editing and 3D reconstruction. This process is computationally expensive, scene-specific, and prone to cross-view inconsistencies. We propose a feed-forward framework for cross-view consistent 3D scene editing from sparse views. Instead of enforcing consistency through iterative 3D refinement, we introduce a cross-view regularization scheme in the image domain during training. By jointly supervising multi-view edits with geometric alignment constraints, our model produces view-consistent results without per-scene optimization at inference. The edited views are then lifted into 3D via a feedforward 3DGS model, yielding a coherent 3DGS representation in a single forward pass. Experiments demonstrate competitive editing fidelity and substantially improved cross-view consistency compared to optimization-based methods, while reducing inference time by orders of magnitude.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.20038 [cs.CV]

(or
arXiv:2604.20038v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.20038

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2604.20030】Learning to count small and clustered objects with application to bacterial colonies

链接https://arxiv.org/abs/2604.20030

作者:Minghua Zheng,Na Helian,Peter C. R. Lane,Yi Sun,Allen Donald

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated bacterial colony, obtain data required, Automated bacterial, vaccines and antibiotics, development of vaccines

备注: 59 pages, 26 figures

点击查看摘要

Abstract:Automated bacterial colony counting from images is an important technique to obtain data required for the development of vaccines and antibiotics. However, bacterial colonies present unique machine vision challenges that affect counting, including (1) small physical size, (2) object clustering, (3) high data annotation cost, and (4) limited cross-species generalisation. While FamNet is an established object counting technique effective for clustered objects and costly data annotation, its effectiveness for small colony sizes and cross-species generalisation remains unknown. To address the first three challenges, we propose ACFamNet, an extension of FamNet that handles small and clustered objects using a novel region of interest pooling with alignment and optimised feature engineering. To address all four challenges above, we introduce ACFamNet Pro, which augments ACFamNet with multi-head attention and residual connections, enabling dynamic weighting of objects and improved gradient flow. Experiments show that ACFamNet Pro achieves a mean normalised absolute error (MNAE) of 9.64% under 5-fold cross-validation, outperforming ACFamNet and FamNet by 2.23% and 12.71%, respectively.

80. 【2604.20027】Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers

链接https://arxiv.org/abs/2604.20027

作者:Ethan Knights

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:processing diverges substantially, human attentional characteristics, image understanding, attentional characteristics, Vision Transformers

备注

点击查看摘要

Abstract:For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional characteristics. We investigate whether this cognitive gap can be shrunk by fine-tuning the self-attention weights of Google's ViT-B/16 on human saliency fixation maps. To isolate the effects of semantically relevant signals from generic human supervision, the tuned model is compared against a shuffled control. Fine-tuning significantly improved alignment across five saliency metrics and induced three hallmark human-like biases: tuning reversed the baseline's anti-human large-object bias toward small-objects, amplified the animacy preference and diminished extreme attention entropy. Bayesian parity analysis provides decisive to very-strong evidence that this cognitive alignment comes at no cost to the model's original classification performance on in- (ImageNet), corrupted (ImageNet-C) and out-of-distribution (ObjectNet) benchmarks. An equivalent procedure applied to a ResNet-50 Convolutional Neural Network (CNN) instead degraded both alignment and accuracy, suggesting that the ViT's modular self-attention mechanism is uniquely suited for dissociating spatial priority from representational logic. These findings demonstrate that biologically grounded priors can be instilled as a free emergent property of human-aligned attention, to improve transformer interpretability.

81. 【2604.20026】Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence

链接https://arxiv.org/abs/2604.20026

作者:Minghua Zheng,Na Helian,Peter C. R. Lane,Yi Sun,Allen Donald

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatic bacterial colony, highly sought-after technology, modern biological laboratories, Automatic bacterial, manual counting effort

备注: 54 pages, 48 figures

点击查看摘要

Abstract:Automatic bacterial colony counting is a highly sought-after technology in modern biological laboratories because it eliminates manual counting effort. Previous work has observed that MicrobiaNet, currently the best-performing cardinality classification model for colony counting, has difficulty distinguishing colonies of three or more individuals. However, it is unclear if this is due to properties of the data together with inherent characteristics of the MicrobiaNet model. By analysing MicrobiaNet with explainable artificial intelligence (XAI), we demonstrate that XAI can provide insights into how data properties constrain cardinality classification performance in colony counting. Our results show that high visual similarity across classes is the key issue hindering further performance improvement, revising prior assertions about MicrobiaNet. These findings suggest future work should focus on models that explicitly incorporate visual similarity or explore density estimation approaches, with broader implications for neural network classifiers trained on imbalanced datasets.

82. 【2604.20012】EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

链接https://arxiv.org/abs/2604.20012

作者:Yiyang Du,Zhanqiu Guo,Xin Ye,Liu Ren,Chenyan Xiong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:inherit their visual, embodied domain, visual and linguistic, linguistic capabilities, capabilities from Vision-Language

备注

点击查看摘要

Abstract:Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

83. 【2604.20000】RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery

链接https://arxiv.org/abs/2604.20000

作者:Bowen Zhang,Jesse T. Boulerice,Charvi Mendiratta,Nikhil Kuniyil,Satish Kumar,Hila Shamon,B. S. Manjunath

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated wildlife monitoring, large-scale expert annotation, rare species, Automated wildlife, imagery is vital

备注

点击查看摘要

Abstract:Automated wildlife monitoring from aerial imagery is vital for conservation but remains limited by two persistent challenges: the difficulty of detecting small, rare species and the high cost of large-scale expert annotation. Prairie dogs exemplify this problem -- they are ecologically important yet appear tiny, sparsely distributed, and visually indistinct from their surroundings, posing a severe challenge for conventional detection models. To overcome these limitations, we present RareSpot+, a detection framework that integrates multi-scale consistency learning, context-aware augmentation, and geospatially guided active learning to address these issues. A novel multi-scale consistency loss aligns intermediate feature maps across detection heads, enhancing localization of small (approx. 30 pixels wide) objects without architectural changes, while context-aware augmentation improves robustness by synthesizing hard, ecologically plausible examples. A geospatial active learning module exploits domain-specific spatial priors linking prairie dogs and burrows, together with test-time augmentation and a meta-uncertainty model, to reduce redundant labeling. On a 2 km^2 aerial dataset, RareSpot+ improves detection over the baseline mAP@50 by +35.2% (absolute +0.13). Cross-dataset tests on HerdNet, AED, and several other wildlife benchmarks demonstrate robust detector-level transferability. The active learning module further boosts prairie dog AP by 14.5% using an annotation budget of just 1.7% of the unlabeled tiles. Beyond detection, RareSpot+ enables spatial ecological analyses such as clustering and co-occurrence, linking vision-based detection with quantitative ecology.

84. 【2604.19999】Optimizing Data Augmentation for Real-Time Small UAV Detection: A Lightweight Context-Aware Approach

链接https://arxiv.org/abs/2604.19999

作者:Amir Zamani(Comprehensive University of the Islamic Revolution),Zeinab Abedini(Sharif University of Technology)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicles, small physical size, Aerial Vehicles, Unmanned Aerial, Visual detection

备注: Accepted for presentation at the 34th International Conference on Electrical Engineering (ICEE 2026)

点击查看摘要

Abstract:Visual detection of Unmanned Aerial Vehicles (UAVs) is a critical task in surveillance systems due to their small physical size and environmental challenges. Although deep learning models have achieved significant progress, deploying them on edge devices necessitates the use of lightweight models, such as YOLOv11 Nano, which possess limited learning capacity. In this research, an efficient and context-aware data augmentation pipeline, combining Mosaic strategies and HSV color-space adaptation, is proposed to enhance the performance of these models. Experimental results on four standard datasets demonstrate that the proposed approach, compared to heavy and instance-level methods like Copy-Paste, not only prevents the generation of synthetic artifacts and overfitting but also significantly improves mean Average Precision (mAP) across all scenarios. Furthermore, the evaluation of generalization capability under foggy conditions revealed that the proposed method offers the optimal balance between Precision and stability for real-time systems, whereas alternative methods, such as MixUp, are effective only in specific applications.

85. 【2604.19995】A Computational Model of Message Sensation Value in Short Video Multimodal Features that Predicts Sensory and Behavioral Engagement

链接https://arxiv.org/abs/2604.19995

作者:Haoning Xue,Jingwen Zhang,Xiaohui Wang,Diane Dagyong Kim,Yunya Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:contemporary media landscape, contemporary media, media landscape, landscape is characterized, characterized by sensational

备注

点击查看摘要

Abstract:The contemporary media landscape is characterized by sensational short videos. While prior research examines the effects of individual multimodal features, the collective impact of multimodal features on viewer engagement with short videos remains unknown. Grounded in the theoretical framework of Message Sensation Value (MSV), this study develops and tests a computational model of MSV with multimodal feature analysis and human evaluation of 1,200 short videos. This model that predicts sensory and behavioral engagement was further validated across two unseen datasets from three short video platforms (combined N = 14,492). While MSV is positively associated with sensory engagement, it shows an inverted U-shaped relationship with behavioral engagement: Higher MSV elicits stronger sensory stimulation, but moderate MSV optimizes behavioral engagement. This research advances the theoretical understanding of short video engagement and introduces a robust computational tool for short video research.

86. 【2604.19989】Online CS-based SAR Edge-Mapping

链接https://arxiv.org/abs/2604.19989

作者:Conor Flynn,Radoslav Ivanov,Birsen Yazici

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicles, small Unmanned Aerial, Automatic Target Recognition, Aerial Vehicles, efficient onboard Automatic

备注: SPIE Defense and Commercial Sensing 2026, Algorithms for Synthetic Aperture Radar Imagery XXXIII

点击查看摘要

Abstract:With modern defense applications increasingly relying on inexpensive, small Unmanned Aerial Vehicles (UAVs), a major challenge lies in designing intelligent and computationally efficient onboard Automatic Target Recognition (ATR) algorithms to carry out operational objectives. This is especially critical in Synthetic Aperture Radar (SAR), where processing techniques such as ATR are often carried out post data collection, requiring onboard systems to bear the memory burden of storing the back-scattered signals. To alleviate this high cost, we propose an online, direct, edge-mapping technique which bypasses the image reconstruction step to classify scenes and targets. Furthermore, by reconstructing the scene as an edge-map we inherently promote sparsity, requiring fewer measurements and computational power than classic SAR reconstruction algorithms such as backprojection.

87. 【2604.19979】Fast Amortized Fitting of Scientific Signals Across Time and Ensembles via Transferable Neural Fields

链接https://arxiv.org/abs/2604.19979

作者:Sophia Zorek,Kushal Vyas,Yuhao Liu,David Lenz,Tom Peterka,Guha Balakrishnan

类目:Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)

关键词:implicit neural representations, modeling continuous geometry, high-dimensional scientific settings, implicit neural, neural representations

备注

点击查看摘要

Abstract:Neural fields, also known as implicit neural representations (INRs), offer a powerful framework for modeling continuous geometry, but their effectiveness in high-dimensional scientific settings is limited by slow convergence and scaling challenges. In this study, we extend INR models to handle spatiotemporal and multivariate signals and show how INR features can be transferred across scientific signals to enable efficient and scalable representation across time and ensemble runs in an amortized fashion. Across controlled transformation regimes (e.g., geometric transformations and localized perturbations of synthetic fields) and high-fidelity scientific domains-including turbulent flows, fluid-material impact dynamics, and astrophysical systems-we show that transferable features improve not only signal fidelity but also the accuracy of derived geometric and physical quantities, including density gradients and vorticity. In particular, transferable features reduce iterations to reach target reconstruction quality by up to an order of magnitude, increase early-stage reconstruction quality by multiple dB (with gains exceeding 10 dB in some cases), and consistently improve gradient-based physical accuracy.

88. 【2604.19976】Lucky High Dynamic Range Smartphone Imaging

链接https://arxiv.org/abs/2604.19976

作者:Baiang Li,Ruyu Yan,Ethan Tseng,Zhoutong Zhang,Adam Finkelstein,Jiawen Chen,Felix Heide

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sensors remain limited, impressive twenty stops, dynamic range, camera sensors remain, decades of research

备注: 13 pages, 12 figures

点击查看摘要

Abstract:While the human eye can perceive an impressive twenty stops of dynamic range, smartphone camera sensors remain limited to about twelve stops despite decades of research. A variety of high dynamic range (HDR) image capture and processing techniques have been proposed, and, in practice, they can extend the dynamic range by 3-5 stops for handheld photography. This paper proposes an approach that robustly captures dynamic range using a handheld smartphone camera and lightweight networks suitable for running on mobile devices. Our method operates indirectly on linear raw pixels in bracketed exposures. Every pixel in the final HDR image is a convex combination of input pixels in the neighborhood, adjusted for exposure, and thus avoids hallucination artifacts typical of recent deep image synthesis networks. We validate our system on both synthetic imagery and unseen real bracketed images -- we confirm zero-shot generalization of the method to smartphone camera captures. Our iterative inference architecture is capable of processing an arbitrary number of bracketed input photos, and we show examples from capture stacks containing 3--9 images. Our training process relies only on synthetic captures yet generalizes to unseen real photos from several cameras. Moreover, we show that this training scheme improves other SOTA methods over their pretrained counterparts.

89. 【2604.19966】DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

链接https://arxiv.org/abs/2604.19966

作者:Divyanshu Goyal,Akhil Eppa,Vanya Bannihatti Kumar

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:image degradations matters, including content moderation, low-level image degradations, image restoration, degradations matters

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly used in settings where sensitivity to low-level image degradations matters, including content moderation, image restoration, and quality monitoring. Yet their ability to recognize distortion type and severity remains poorly understood. We present DistortBench, a diagnostic benchmark for no-reference distortion perception in VLMs. DistortBench contains 13,500 four-choice questions covering 27 distortion types, six perceptual categories, and five severity levels: 25 distortions inherit KADID-10k calibrations, while two added rotation distortions use monotonic angle-based levels. We evaluate 18 VLMs, including 17 open-weight models from five families and one proprietary model. Despite strong performance on high-level vision-language tasks, the best model reaches only 61.9% accuracy, just below the human majority-vote baseline of 65.7% (average individual: 60.2%), indicating that low-level perceptual understanding remains a major weakness of current VLMs. Our analysis further reveals weak and non-monotonic scaling with model size, performance drops in most base--thinking pairs, and distinct severity-response patterns across model families. We hope DistortBench will serve as a useful benchmark for measuring and improving low-level visual perception in VLMs.

90. 【2604.19954】Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

链接https://arxiv.org/abs/2604.19954

作者:Xinxuan Lu,Charless Fowlkes,Alexander C. Berg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise camera control, provide precise camera, struggle to provide, natural language, Current

备注

点击查看摘要

Abstract:Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be endowed with explicit 3D camera structure, offering a pathway toward geometrically-aware prompts for text-to-image generation. Project page: this https URL

91. 【2604.19945】Visual Reasoning through Tool-supervised Reinforcement Learning

链接https://arxiv.org/abs/2604.19945

作者:Qihua Dong,Gozde Sahin,Pei Wang,Zhaowei Cai,Robik Shrestha,Hao Yang,Davide Modolo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注: Accepted to CVPR 2026 Findings. 17 pages

点击查看摘要

Abstract:In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

92. 【2604.19941】CrackForward: Context-Aware Severity Stage Crack Synthesis for Data Augmentation

链接https://arxiv.org/abs/2604.19941

作者:Nassim Sadallah,Mohand Saïd Allili

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structural health monitoring, Reliable crack detection, well-annotated data constitutes, health monitoring, major challenge

备注: 6

点击查看摘要

Abstract:Reliable crack detection and segmentation are vital for structural health monitoring, yet the scarcity of well-annotated data constitutes a major challenge. To address this limitation, we propose a novel context-aware generative framework designed to synthesize realistic crack growth patterns for data augmentation. Unlike existing methods that primarily manipulate textures or background content, CrackForward explicitly models crack morphology by combining directional crack elongation with learned thickening and branching. Our framework integrates two key innovations: (i) a contextually guided crack expansion module, which uses local directional cues and adaptive random walk to simulate realistic propagation paths; and (ii) a two-stage U-Net-style generator that learns to reproduce spatially varying crack characteristics such as thickness, branching, and growth. Experimental results show that the generated samples preserve target-stage saturation and thickness characteristics and improve the performance of several crack segmentation architectures. These results indicate that structure-aware synthetic crack generation can provide more informative training data than conventional augmentation alone.

93. 【2604.19937】Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

链接https://arxiv.org/abs/2604.19937

作者:Palawat Busaranuvong,Reza Saadati Fard,Emmanuel Agu,Deepak Kumar,Shefalika Gautam,Bengisu Tulu,Diane Strong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visual appearance varies, Assessing chronic wound, Assessing chronic, anatomical locations, imaging conditions

备注

点击查看摘要

Abstract:Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

94. 【2604.19923】UniCon3R: Contact-aware 3D Human-Scene Reconstruction from Monocular Video

链接https://arxiv.org/abs/2604.19923

作者:Tanuj Sur,Shashank Tripathi,Nikos Athanasiou,Ha Linh Nguyen,Kai Xu,Michael J. Black,Angela Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unified feed-forward framework, Unified Contact-aware, unified feed-forward, Unified, feed-forward framework

备注: Project page: [this https URL](https://surtantheta.github.io/UniCon3R)

点击查看摘要

Abstract:We introduce UniCon3R (Unified Contact-aware 3D Reconstruction), a unified feed-forward framework for online human-scene 4D reconstruction from monocular videos. Recent feed-forward methods enable real-time world-coordinate human motion and scene reconstruction, but they often produce physically implausible artifacts such as bodies floating above the ground or penetrating parts of the scene. The key reason is that existing approaches fail to model physical interactions between the human and the environment. A natural next step is to predict human-scene contact as an auxiliary output -- yet we find this alone is not sufficient: contact must actively correct the reconstruction. To address this, we explicitly model interaction by inferring 3D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the final pose. This enables UniCon3R to jointly recover high-fidelity scene geometry and spatially aligned 3D humans within the scene. Experiments on standard human-centric video benchmarks such as RICH, EMDB, 3DPW and SLOPER4D show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while achieving real-time online inference. We experimentally demonstrate that contact serves as a powerful internal prior rather than just an external metric, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at this https URL .

95. 【2604.19907】SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

链接https://arxiv.org/abs/2604.19907

作者:Yun He,Kelin Yu,Matthias Zwicker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent agentic frameworks, Recent agentic, integrating heterogeneous generation, synthesis have advanced, advanced realism

备注

点击查看摘要

Abstract:Recent agentic frameworks for 3D scene synthesis have advanced realism and diversity by integrating heterogeneous generation and editing tools. These tools are organized into workflows orchestrated by an off-the-shelf LLM. Current approaches typically adopt an execute-review-reflect loop: at each step, the orchestrator executes a tool, renders intermediate results for review, and then decides on the tool and its parameters for the next step. However, this design has two key limitations. First, next-step tool selection and parameter configuration are driven by heuristic rules, which can lead to suboptimal execution flows, unnecessary tool invocations, degraded output quality, and increased runtime. Second, rendering and reviewing intermediate results after each step introduces additional latency. To address these issues, we propose SceneOrchestra, a trainable orchestration framework that optimizes the tool-call execution flow and eliminates the step-by-step review loop, improving both efficiency and output quality. SceneOrchestra consists of an orchestrator and a discriminator, which we fine-tune with a two-phase training strategy. In the first phase, the orchestrator learns context-aware tool selection and complete tool-call trajectory generation, while the discriminator is trained to assess the quality of full trajectories, enabling it to select the best trajectory from multiple candidates. In the second phase, we perform interleaved training, where the discriminator adapts to the orchestrator's evolving trajectory distribution and distills its discriminative capability back into the orchestrator. At inference, we only use the orchestrator to generate and execute full tool-call trajectories from instructions, without requiring the discriminator. Extensive experiments show that our method achieves state-of-the-art scene quality while reducing runtime compared to previous work.

96. 【2604.19902】MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

链接https://arxiv.org/abs/2604.19902

作者:Zijie Li,Yichun Shi,Jingxiang Sun,Ye Wang,Yixuan Huang,Zhiyao Guo,Xiaochen Lian,Peihao Zhu,Yu Tian,Zhonghua Zhai,Peng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:unified framework designed, unified framework, framework designed, MMCORE, present MMCORE

备注

点击查看摘要

Abstract:We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2604.19902 [cs.CV]

(or
arXiv:2604.19902v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.19902

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
97. 【2604.19888】SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

链接https://arxiv.org/abs/2604.19888

作者:Pavan Kumar Sharma,Pranamesh Chakraborty

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gaze estimation, Driver gaze estimation, gaze estimation models, driver situational awareness, gaze

备注

点击查看摘要

Abstract:Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.

98. 【2604.19858】Wan-Image: Pushing the Boundaries of Generative Visual Intelligence

链接https://arxiv.org/abs/2604.19858

作者:Chaojie Mao,Chen-Wei Xie,Chongyang Zhong,Haoyou Deng,Jiaxing Zhao,Jie Xiao,Jinbo Xing,Jingfeng Zhang,Jingren Zhou,Jingyi Zhang,Jun Dan,Kai Zhu,Kang Zhao,Keyu Yan,Minghui Chen,Pandeng Li,Shuangle Chen,Tong Shen,Yu Liu,Yue Jiang,Yulin Pan,Yuxiang Tuo,Zeyinzi Jiang,Zhen Han,Ang Wang,Bang Zhang,Baole Ai,Bin Wen,Boang Feng,Feiwu Yu,Gang Wang,Haiming Zhao,He Kang,Jianjing Xiang,Jianyuan Zeng,Jinkai Wang,Ke Sun,Linqian Wu,Pei Gong,Pingyu Wu,Ruiwen Wu,Tongtong Su,Wenmeng Zhou,Wenting Shen,Wenyuan Yu,Xianjun Xu,Xiaoming Huang,Xiejie Shen,Xin Xu,Yan Kou,Yangyu Lv,Yifan Zhai,Yitong Huang,Yun Zheng,Yuntao Hong,Zhicheng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:system explicitly engineered, generation system explicitly, professional-grade productivity tools, paradigm-shift image generation, system explicitly

备注

点击查看摘要

Abstract:We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.

99. 【2604.19844】If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

链接https://arxiv.org/abs/2604.19844

作者:Jiamin Chang,Minhui Xue,Ruoxi Sun,Shuchao Pang,Salil S. Kanhere,Hammond Pearce

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-Language Agentic Systems, large vision-language models, embodied Vision-Language Agentic, Agentic Systems, Vision-Language Agentic

备注

点击查看摘要

Abstract:Recent advances in embodied Vision-Language Agentic Systems (VLAS), powered by large vision-language models (LVLMs), enable AI systems to perceive and reason over real-world scenes. Within this context, environmental signals such as traffic lights are essential in-band signals that can and should influence agent behavior. However, similar signals could also be crafted to operate as misleading visual injections, overriding user intent and posing security risks. This duality creates a fundamental challenge: agents must respond to legitimate environmental cues while remaining robust to misleading ones. We refer to this tension as trust boundary confusion. To study this behavior, we design a dual-intent dataset and evaluation framework, through which we show that current LVLM-based agents fail to reliably balance this trade-off, either ignoring useful signals or following harmful ones. We systematically evaluate 7 LVLM agents across multiple embodied settings under both structure-based and noise-based visual injections. To address these vulnerabilities, we propose a multi-agent defense framework that separates perception from decision-making to dynamically assess the reliability of visual inputs. Our approach significantly reduces misleading behaviors while preserving correct responses and provides robustness guarantees under adversarial perturbations. The code of the evaluation framework and artifacts are made available at this https URL.

100. 【2604.19839】Environmental Understanding Vision-Language Model for Embodied Agent

链接https://arxiv.org/abs/2604.19839

作者:Jinsik Bang,Jaeyeon Bae,Donggyu Lee,Siyeol Jung,Taehwan Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-language models, Understanding Embodied Agent, shown strong perception, instruction-following embodied agents, Environmental Understanding Embodied

备注: CVPR Findings 2026, Project Page: [this https URL](https://eu-ea.github.io)

点击查看摘要

Abstract:Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named Environmental Understanding Embodied Agent (EUEA), which fine-tunes four core skills: 1) object perception for identifying relevant objects, 2) task planning for generating interaction subgoals, 3) action understanding for judging success likelihood, and 4) goal recognition for determining goal completion. By fine-tuning VLMs with EUEA skills, our framework enables more reliable task execution for instruction-following. We further introduce a recovery step that leverages these core skills and a group relative policy optimization (GRPO) stage that refines inconsistent skill predictions. The recovery step samples alternative actions to correct failure cases, and the GRPO stage refines inconsistent skill predictions. Across ALFRED tasks, our VLM significantly outperforms a behavior-cloning baseline, achieving an 8.86% improvement in average success rate. The recovery and GRPO stages provide an additional 3.03% gain, further enhancing overall performance. Finally, our skill-level analyses reveal key limitations in the environmental understanding of closed- and open-source VLMs and identify the capabilities necessary for effective agent-environment interaction.

101. 【2604.19834】KD-Judge: A Knowledge-Driven Automated Judge Framework for Functional Fitness Movements on Edge Devices

链接https://arxiv.org/abs/2604.19834

作者:Shaibal Saha,Fan Li,Yunge Li,Arun Iyengar,Lucas Alves,Lanyu Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:health-oriented exercise programs, consistently enforcing repetition, remains challenging due, Functional fitness movements, standards remains challenging

备注: Accepted at IEEE/ACM CHASE 2026

点击查看摘要

Abstract:Functional fitness movements are widely used in training, competition, and health-oriented exercise programs, yet consistently enforcing repetition (rep) standards remains challenging due to subjective human judgment, time constraints, and evolving rules. Existing AI-based approaches mainly rely on learned scoring or reference-based comparisons and lack explicit rule-based, limiting transparency and deterministic rep-level validation. To address these limitations, we propose KD-Judge, a novel knowledge-driven automated judging framework for functional fitness movements. It converts unstructured rulebook standards into executable, machine-readable representations using an LLM-based retrieval-augmented generation and chain-of-thought rule-structuring pipeline. The structured rules are then incorporated by a deterministic rule-based judging system with pose-guided kinematic reasoning to assess rep validity and temporal boundaries. To improve efficiency on edge devices, including a high-performance desktop and the resource-constrained Jetson AGX Xavier, we introduce a dual strategy caching mechanism that can be selectively applied to reduce redundant and unnecessary computation. Experiments demonstrate reliable rule-structuring performance and accurate rep-level assessment, with judgment evaluation conducted on the CFRep dataset, achieving faster-than-real-time execution (real-time factor (RTF) 1). When the proposed caching strategy is enabled, the system achieves up to 3.36x and 15.91x speedups on resource-constrained edge device compared to the non-caching baseline for pre-recorded and live-streaming scenarios, respectively. These results show that KD-Judge enables transparent, efficient, and scalable rule-grounded rep-level analysis that can complement human judging in practice.

102. 【2604.19829】actileEval: A Step Towards Automated Fine-Grained Evaluation and Editing of Tactile Graphics

链接https://arxiv.org/abs/2604.19829

作者:Adnan Khan,Abbas Akkasi,Majid Komeili

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Tactile graphics require, actionable repair signal, graphics require careful, Tactile graphics, require careful expert

备注: Code, data, and models are available at [this https URL](https://TactileEval.github.io/)

点击查看摘要

Abstract:Tactile graphics require careful expert validation before reaching blind and visually impaired (BVI) learners, yet existing datasets provide only coarse holistic quality ratings that offer no actionable repair signal. We present TactileEval, a three-stage pipeline that takes a first step toward automating this process. Drawing on expert free-text comments from the TactileNet dataset, we establish a five-category quality taxonomy; encompassing view angle, part completeness, background clutter, texture separation, and line quality aligned with BANA standards. We subsequently gathered 14,095 structured annotations via Amazon Mechanical Turk, spanning 66 object classes organized into six distinct families. A reproducible ViT-L/14 feature probe trained on this data achieves 85.70% overall test accuracy across 30 different tasks, with consistent difficulty ordering suggesting the taxonomy suggesting the taxonomy captures meaningful perceptual structure. Building on these evaluations, we present a ViT-guided automated editing pipeline that routes classifier scores through family-specific prompt templates to produce targeted corrections via gpt-image-1 image editing. Code, data, and models are available at this https URL

103. 【2604.19823】Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

链接https://arxiv.org/abs/2604.19823

作者:Khalil Akremi,Mariem Handous,Zied Bouslama,Farah Bassalah,Maryem Jebali,Mariem Hanachi,Ines Abdeljaoued-Tej

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:African and Asian, Asian countries, effective epidemiological surveillance, epidemiological surveillance, remains a major

备注: This work has been accepted for publication in ICMI IEEE Conference (04/2026)

点击查看摘要

Abstract:Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

104. 【2604.19798】Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics

链接https://arxiv.org/abs/2604.19798

作者:Xinxin Zhuo,Mengyuan Niu,Ruizhe Wang,Junyan Yang,Qiao Wang

类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Econometrics (econ.EM)

关键词:Micro-scale street-level economic, street-level economic assessment, spatial resource allocation, Micro-scale street-level, Street View Imagery

备注: Submitted to ACM Transactions on Spatial Computing. This paper is currently under review

点击查看摘要

Abstract:Micro-scale street-level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual-semantic and field-based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual-stage VLM-LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location-Based Services (LBS) data to capture realized demand. Combined with a category-weighted Gaussian spillover model, we construct a three-dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time-lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi-causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall-induced externalities. High-quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence-based support for precision spatial governance.

Comments:
Submitted to ACM Transactions on Spatial Computing. This paper is currently under review

Subjects:

Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Econometrics (econ.EM)

Cite as:
arXiv:2604.19798 [cs.CY]

(or
arXiv:2604.19798v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2604.19798

Focus to learn more

              arXiv-issued DOI via DataCite</p>
105. 【2604.19770】Hybrid Multi-Phase Page Matching and Multi-Layer Diff Detection for Japanese Building Permit Document Review

链接https://arxiv.org/abs/2604.19770

作者:Mitsumasa Wada

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Japanese building permit, permit document sets, comparison of Japanese, hybrid multi-phase page, PDF document sets

备注: 9 pages, 3 figures

点击查看摘要

Abstract:We present a hybrid multi-phase page matching algorithm for automated comparison of Japanese building permit document sets. Building permit review in Japan requires cross-referencing large PDF document sets across revision cycles, a process that is labor-intensive and error-prone when performed manually. The algorithm combines longest common subsequence (LCS) structural alignment, a seven-phase consensus matching pipeline, and a dynamic programming optimal alignment stage to robustly pair pages across revisions even when page order, numbering, or content changes substantially. A subsequent multi-layer diff engine -- comprising text-level, table-level, and pixel-level visual differencing -- produces highlighted difference reports. Evaluation on real-world permit document sets achieves F1=0.80 and precision=1.00 on a manually annotated ground-truth benchmark, with zero false-positive matched pairs.

106. 【2604.20154】Maximum Likelihood Reconstruction for Multi-Look Digital Holography with Markov-Modeled Speckle Correlation

链接https://arxiv.org/abs/2604.20154

作者:Xi Chen,Arian Maleki,Shirin Jalali

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:coherent imaging systems, reducing speckle noise, digital holography, widely used strategy, strategy for reducing

备注

点击查看摘要

Abstract:Multi-look acquisition is a widely used strategy for reducing speckle noise in coherent imaging systems such as digital holography. By acquiring multiple measurements, speckle can be suppressed through averaging or joint reconstruction, typically under the assumption that speckle realizations across looks are statistically independent. In practice, however, hardware constraints limit measurement diversity, leading to inter-look correlation that degrades the performance of conventional methods. In this work, we study the reconstruction of speckle-free reflectivity from complex-valued multi-look measurements in the presence of correlated speckle. We model the inter-look dependence using a first-order Markov process and derive the corresponding likelihood under a first-order Markov approximation, resulting in a constrained maximum likelihood estimation problem. To solve this problem, we develop an efficient projected gradient descent framework that combines gradient-based updates with implicit regularization via deep image priors, and leverages Monte Carlo approximation and matrix-free operators for scalable computation. Simulation results demonstrate that the proposed approach remains robust under strong inter-look correlation, achieving performance close to the ideal independent-look scenario and consistently outperforming methods that ignore such dependencies. These results highlight the importance of explicitly modeling inter-look correlation and provide a practical framework for multi-look holographic reconstruction under realistic acquisition conditions. Our code is available at: this https URL.