本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新624篇论文,其中:

  • 自然语言处理117
  • 信息检索21
  • 计算机视觉115

自然语言处理

1. 【2604.15309】MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

链接https://arxiv.org/abs/2604.15309

作者:Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, increasingly adopted paradigm, tools enables images

备注

点击查看摘要

Abstract:The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code Data: this https URL.

2. 【2604.15302】Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

链接https://arxiv.org/abs/2604.15302

作者:Manan Gupta,Dhruv Kumar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:automatic NLG evaluation, NLG evaluation, automatic NLG, remains poorly understood, frameworks are increasingly

备注: Under Review

点击查看摘要

Abstract:LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

3. 【2604.15267】CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

链接https://arxiv.org/abs/2604.15267

作者:Emanuel Tewolde,Xiao Zhang,David Guzman Piedrahita,Vincent Conitzer,Zhijing Jin

类目:Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

关键词:public goods settings, reasoning capabilities behave, recent works report, stronger reasoning capabilities, agents interact effectively

备注: 65 pages, 38 Figures, 8 Tables, 17 Listings

点击查看摘要

Abstract:It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, recent works report the opposite trend: LLMs with stronger reasoning capabilities behave _less_ cooperatively in mixed-motive games such as the prisoner's dilemma and public goods settings. Indeed, our experiments show that recent models -- with or without reasoning enabled -- consistently defect in single-shot social dilemmas. To tackle this safety concern, we present the first comparative study of game-theoretic mechanisms that are designed to enable cooperative outcomes between rational agents _in equilibrium_. Across four social dilemmas testing distinct components of robust cooperation, we evaluate the following mechanisms: (1) repeating the game for many rounds, (2) reputation systems, (3) third-party mediators to delegate decision making to, and (4) contract agreements for outcome-conditional payments between players. Among our findings, we establish that contracting and mediation are most effective in achieving cooperative outcomes between capable LLM models, and that repetition-induced cooperation deteriorates drastically when co-players vary. Moreover, we demonstrate that these cooperation mechanisms become _more effective_ under evolutionary pressures to maximize individual payoffs.

Comments:
65 pages, 38 Figures, 8 Tables, 17 Listings

Subjects:

Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)

MSC classes:
68T05, 68T42, 91A05, 91A06, 91A10, 91A20,

ACMclasses:
I.2; J.4; K.4

Cite as:
arXiv:2604.15267 [cs.GT]

(or
arXiv:2604.15267v1 [cs.GT] for this version)

https://doi.org/10.48550/arXiv.2604.15267

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2604.15244】From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

链接https://arxiv.org/abs/2604.15244

作者:Kiran Purohit,Ramasuri Narayanam,Soumyabrata Pal

类目:Computation and Language (cs.CL)

关键词:accelerates large language, large language model, language model inference, accelerates large, large language

备注

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

5. 【2604.15224】Context Over Content: Exposing Evaluation Faking in Automated Judges

链接https://arxiv.org/abs/2604.15224

作者:Manan Gupta,Inderjeet Nair,Lu Wang,Dhruv Kumar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:evaluate text strictly, judges evaluate text, surrounding contextual framing, unverified assumption, impervious to surrounding

备注: Under Review

点击查看摘要

Abstract:The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent $\textit{leniency bias}$: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching $\Delta V = -9.8 pp$ (a $30\%$ relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on ($\mathrm{ERR}_J = 0.000$ across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.

6. 【2604.15210】Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

链接https://arxiv.org/abs/2604.15210

作者:Hatice Merve Vural,Doga Kukul,Ege Erdem Ozlu,Demir Ekin Arikan,Bob Mankoff,Erkut Erdem,Aykut Erdem

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Cartoon Caption Contest, Yorker Cartoon Caption, Yorker Cartoon, humor understanding, evaluates humor understanding

备注

点击查看摘要

Abstract:Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

7. 【2604.15203】MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

链接https://arxiv.org/abs/2604.15203

作者:Raunak Agarwal,Markus Wenzel,Simon Baur,Jonas Zimmer,George Harvey,Jackie Ma

类目:Computation and Language (cs.CL)

关键词:support human oversight, Machine learning, strong predictive performance, human oversight, learning in high-stakes

备注: Accepted at ACL 2026 Mains

点击查看摘要

Abstract:Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at this https URL.

8. 【2604.15190】Meituan Merchant Business Diagnosis via Policy-Guided Dual-Process User Simulation

链接https://arxiv.org/abs/2604.15190

作者:Ziyang Chen,Renbing Chen,Daowei Li,Jinzhi Liao,Jiashen Sun,Ke Zeng,Xiang Zhao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:costly online experiments, user behavior enables, behavior enables scalable, enables scalable counterfactual, scalable counterfactual evaluation

备注: 5 pages, 3 figures, 2 tables, accepted at SIGIR 2026 Industry Track

点击查看摘要

Abstract:Simulating group-level user behavior enables scalable counterfactual evaluation of merchant strategies without costly online experiments. However, building a trustworthy simulator faces two structural challenges. First, information incompleteness causes reasoning-based simulators to over-rationalize when unobserved factors such as offline context and implicit habits are missing. Second, mechanism duality requires capturing both interpretable preferences and implicit statistical regularities, which no single paradigm achieves alone. We propose Policy-Guided Hybrid Simulation (PGHS), a dual-process framework that mines transferable decision policies from behavioral trajectories and uses them as a shared alignment layer. This layer anchors an LLM-based reasoning branch that prevents over-rationalization and an ML-based fitting branch that absorbs implicit regularities. Group-level predictions from both branches are fused for complementary correction. We deploy PGHS on Meituan with 101 merchants and over 26,000 trajectories. PGHS achieves a group simulation error of 8.80%, improving over the best reasoning-based and fitting-based baselines by 45.8% and 40.9% respectively.

9. 【2604.15180】AdaSplash-2: Faster Differentiable Sparse Attention

链接https://arxiv.org/abs/2604.15180

作者:Nuno Gonçalves,Hugo Pitorro,Vlad Niculae,Edoardo Ponti,Lei Li,Andre Martins,Marcos Treviso

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词

备注

点击查看摘要

None

10. 【2604.15165】Fabricator or dynamic translator?

链接https://arxiv.org/abs/2604.15165

作者:Lisa Vasileva,Karin Sim

类目:Computation and Language (cs.CL)

关键词:adept at machine, machine translation, translation although due, times overgenerate, generative nature

备注: Published here: [this https URL](https://chomps2025.github.io/accepted_papers.html)

点击查看摘要

Abstract:LLMs are proving to be adept at machine translation although due to their generative nature they may at times overgenerate in various ways. These overgenerations are different from the neurobabble seen in NMT and range from LLM self-explanations, to risky confabulations, to appropriate explanations, where the LLM is able to act as a human translator would, enabling greater comprehension for the target audience. Detecting and determining the exact nature of the overgenerations is a challenging task. We detail different strategies we have explored for our work in a commercial setting, and present our results.

11. 【2604.15153】Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

链接https://arxiv.org/abs/2604.15153

作者:Zihao Xu,John Harvill,Ziwei Fan,Yizhou Sun,Hao Ding,Hao Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, incur significant computational, processing long prompts

备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

12. 【2604.15151】QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

链接https://arxiv.org/abs/2604.15151

作者:Alexey Khoroshilov,Alexey Chernysh,Orkhan Ekhtibarov,Nini Kamkia,Dmitry Zmitrovich

类目:Computation and Language (cs.CL)

关键词:Large language models, demonstrated strong performance, generate executable algorithmic, strategies remains underexplored, Large language

备注: 12 pages, 8 tables

点击查看摘要

Abstract:Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present QuantCode-Bench, a benchmark for the systematic evaluation of modern LLMs in generating strategies for the Backtrader framework from textual descriptions in English. The benchmark contains 400 tasks of varying difficulty collected from Reddit, TradingView, StackExchange, GitHub, and synthetic sources. Evaluation is conducted through a multi-stage pipeline that checks syntactic correctness, successful backtest execution, the presence of trades, and semantic alignment with the task description using an LLM judge. We compare state-of-the-art models in two settings: single-turn, where the strategy must be generated correctly on the first attempt, and agentic multi-turn, where the model receives iterative feedback and may repair its errors. We analyze the failure modes across different stages of the pipeline and show that the main limitations of current models are not related to syntax, but rather to the correct operationalization of trading logic, proper API usage, and adherence to task semantics. These findings suggest that trading strategy generation constitutes a distinct class of domain-specific code generation tasks in which success requires not only technical correctness, but also alignment between natural-language descriptions, financial logic, and the observable behavior of the strategy on data.

13. 【2604.15148】IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

链接https://arxiv.org/abs/2604.15148

作者:Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Huangyu Dai,Lingtao Mao,Xuxin Zhang,Chenyi Lei,Wenwu Ou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:large language models, perform search-augmented reasoning, training large language, effective paradigm, large language

备注

点击查看摘要

Abstract:Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

14. 【2604.15140】DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering

链接https://arxiv.org/abs/2604.15140

作者:Neha Srikanth,Jordan Boyd-Graber,Rachel Rudinger

类目:Computation and Language (cs.CL)

关键词:method to identify, responding to information-seeking, introduce DiscoTrace, information-seeking questions, Abstract

备注

点击查看摘要

Abstract:We introduce DiscoTrace, a method to identify the rhetorical strategies that answerers use when responding to information-seeking questions. DiscoTrace represents answers as a sequence of question-related discourse acts paired with interpretations of the original question, annotated on top of rhetorical structure theory parses. Applying DiscoTrace to answers from nine different human communities reveals that communities have diverse preferences for answer construction. In contrast, LLMs do not exhibit rhetorical diversity in their answers, even when prompted to mimic specific human community answering guidelines. LLMs also systematically opt for breadth, addressing interpretations of questions that human answerers choose not to address. Our findings can guide the development of pragmatic LLM answerers that consider a range of strategies informed by context in QA.

15. 【2604.15124】Blinded Multi-Rater Comparative Evaluation of a Large Language Model and Clinician-Authored Responses in CGM-Informed Diabetes Counseling

链接https://arxiv.org/abs/2604.15124

作者:Zhijun Guo,Alvina Lai,Emmanouil Korakas,Aristeidis Vagenas,Irshad Ahamed,Christo Albor,Hengrui Zhang,Justin Healy,Kezhi Li

类目:Computation and Language (cs.CL)

关键词:Continuous glucose monitoring, explaining CGM patterns, empathetically remains time-intensive, Continuous glucose, glucose monitoring

备注

点击查看摘要

Abstract:Continuous glucose monitoring (CGM) is central to diabetes care, but explaining CGM patterns clearly and empathetically remains time-intensive. Evidence for retrieval-grounded large language model (LLM) systems in CGM-informed counseling remains limited. To evaluate whether a retrieval-grounded LLM-based conversational agent (CA) could support patient understanding of CGM data and preparation for routine diabetes consultations. We developed a retrieval-grounded LLM-based CA for CGM interpretation and diabetes counseling support. The system generated plain-language responses while avoiding individualized therapeutic advice. Twelve CGM-informed cases were constructed from publicly available datasets. Between Oct 2025 and Feb 2026, 6 senior UK diabetes clinicians each reviewed 2 assigned cases and answered 24 questions. In a blinded multi-rater evaluation, each CA-generated and clinician-authored response was independently rated by 3 clinicians on 6 quality dimensions. Safety flags and perceived source labels were also recorded. Primary analyses used linear mixed-effects models. A total of 288 unique responses (144 CA and 144 clinician) generated 864 ratings. The CA received higher quality scores than clinician responses (mean 4.37 vs 3.58), with an estimated mean difference of 0.782 points (95% CI 0.692-0.872; P.001). The largest differences were for empathy (1.062, 95% CI 0.948-1.177) and actionability (0.992, 95% CI 0.877-1.106). Safety flag distributions were similar, with major concerns rare in both groups (3/432, 0.7% each). Retrieval-grounded LLM systems may have value as adjunct tools for CGM review, patient education, and preconsultation preparation. However, these findings do not support autonomous therapeutic decision-making or unsupervised real-world use.

16. 【2604.15109】IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

链接https://arxiv.org/abs/2604.15109

作者:Haozhi Fan,Jinhao Duan,Kaidi Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, advancement of Large, Language Models, rapid advancement

备注

点击查看摘要

Abstract:Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at this https URL.

17. 【2604.15097】From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution

链接https://arxiv.org/abs/2604.15097

作者:Junjie Wang,Yiming Ren,Haoyang Zhang

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:beta technical report, beta technical, technical report, effective test-time control, iterative evolution

备注: Technical Report

点击查看摘要

Abstract:This beta technical report asks how reusable experience should be represented so that it can function as effective test-time control and as a substrate for iterative evolution. We study this question in 4.590 controlled trials across 45 scientific code-solving scenarios. We find that documentation-oriented Skill packages provide unstable control: their useful signal is sparse, and expanding a compact experience object into a fuller documentation package often fails to help and can degrade the overall average. We further show that representation itself is a first-order factor. A compact Gene representation yields the strongest overall average, remains competitive under substantial structural perturbations, and outperforms matched-budget Skill fragments, while reattaching documentation-oriented material usually weakens rather than improves it. Beyond one-shot control, we show that Gene is also a better carrier for iterative experience accumulation: attached failure history is more effective in Gene than in Skill or freeform text, editable structure matters beyond content alone, and failure information is most useful when distilled into compact warnings rather than naively appended. On CritPt, gene-evolved systems improve over their paired base models from 9.1% to 18.57% and from 17.7% to 27.14%. These results suggest that the core problem in experience reuse is not how to supply more experience, but how to encode experience as a compact, control-oriented, evolution-ready object.

18. 【2604.15093】OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

链接https://arxiv.org/abs/2604.15093

作者:Kanzhi Cheng,Zehao Li,Zheng Ma,Nuo Chen,Jialin Cao,Qiushi Sun,Zichen Ding,Fangzhi Xu,Hang Yan,Jiajun Chen,Anh Tuan Luu,Jianbing Zhang,Lewei Lu,Dahua Lin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:demonstrated impressive capabilities, recent leading models, leading models achieving, marked performance leap, automating mobile tasks

备注: Work in progress

点击查看摘要

Abstract:Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at this https URL to bridge the data gap and facilitate broader mobile agent research.

19. 【2604.15037】From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

链接https://arxiv.org/abs/2604.15037

作者:Ke Xu,Yuhao Wang,Yu Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Recent advancements, text-based paradigms, gradually shifting, multimodal interaction, Recent

备注

点击查看摘要

Abstract:Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

20. 【2604.15022】Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

链接https://arxiv.org/abs/2604.15022

作者:Haochun Tang,Yuliang Yan,Jiahua Lu,Huaxiao Liu,Enyan Dai

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Cost-aware routing dynamically, dynamically dispatches user, routing dynamically dispatches, dispatches user queries, Cost-aware routing

备注

点击查看摘要

Abstract:Cost-aware routing dynamically dispatches user queries to models of varying capability to balance performance and inference cost. However, the routing strategy introduces a new security concern that adversaries may manipulate the router to consistently select expensive high-capability models. Existing routing attacks depend on either white-box access or heuristic prompts, rendering them ineffective in real-world black-box scenarios. In this work, we propose R$^2$A, which aims to mislead black-box LLM routers to expensive models via adversarial suffix optimization. Specifically, R$^2$A deploys a hybrid ensemble surrogate router to mimic the black-box router. A suffix optimization algorithm is further adapted for the ensemble-based surrogate. Extensive experiments on multiple open-source and commercial routing systems demonstrate that {R$^2$A} significantly increases the routing rate to expensive models on queries of different distributions. Code and examples: this https URL.

21. 【2604.15010】What Is the Minimum Architecture for Prolepsis? Early Irrevocable Commitment Across Tasks in Small Transformers

链接https://arxiv.org/abs/2604.15010

作者:Éric Jacopin

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:transformer commits early, transformers commit, transformer commits, commits early, Specific attention heads

备注: 24 pages, 3 figures. Under review at COLM 2026. Independent replication of the rhyme-planning finding from Lindsey et al. (2025) on open-weights models; extended to factual recall

点击查看摘要

Abstract:When do transformers commit to a decision, and what prevents them from correcting it? We introduce \textbf{prolepsis}: a transformer commits early, task-specific attention heads sustain the commitment, and no layer corrects it. Replicating \citeauthor{lindsey2025biology}'s (\citeyear{lindsey2025biology}) planning-site finding on open models (Gemma~2 2B, Llama~3.2 1B), we ask five questions. (Q1)~Planning is invisible to six residual-stream methods; CLTs are necessary. (Q2)~The planning-site spike replicates with identical geometry. (Q3)~Specific attention heads route the decision to the output, filling a gap flagged as invisible to attribution graphs. (Q4)~Search requires ${\leq}16$ layers; commitment requires more. (Q5)~Factual recall shows the same motif at a different network depth, with zero overlap between recurring planning heads and the factual top-10. Prolepsis is architectural: the template is shared, the routing substrates differ. All experiments run on a single consumer GPU (16\,GB VRAM).

22. 【2604.14980】Hybrid Decision Making via Conformal VLM-generated Guidance

链接https://arxiv.org/abs/2604.14980

作者:Debodeep Banerjee,Burcu Sayin,Stefano Teso,Andrea Passerini

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:reducing cognitive load, hybrid decision making, Building on recent, improving human decision, human decision quality

备注

点击查看摘要

Abstract:Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

23. 【2604.14970】Explain the Flag: Contextualizing Hate Speech Beyond Censorship

链接https://arxiv.org/abs/2604.14970

作者:Jason Liartis,Eirini Kaldeli,Lambrini Gyftokosta,Eleftherios Chelioudakis,Orfeas Menis Mastromichalakis

类目:Computation and Language (cs.CL)

关键词:offensive speech remains, public discourse, remains a persistent, persistent challenge, challenge in online

备注: Accepted in the Findings of ACL 2026

点击查看摘要

Abstract:Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.

24. 【2604.14951】RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

链接https://arxiv.org/abs/2604.14951

作者:Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Large Language Models, standalone language generation, invoke external resources, Multimodal Large Language, foundation models aims

备注: ICPR 2026

点击查看摘要

Abstract:Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

25. 【2604.14941】xt2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions

链接https://arxiv.org/abs/2604.14941

作者:Shivank Garg,Sankalp Mittal,Manish Gupta

类目:Computation and Language (cs.CL)

关键词:Communicating complex system, Communicating complex, prone to ambiguity, inefficient and prone, complex system designs

备注: ICLR 2026 Poster

点击查看摘要

Abstract:Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.

26. 【2604.14934】XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

链接https://arxiv.org/abs/2604.14934

作者:Jingxuan Liu,Zhi Qu,Jin Tei,Hidetaka Kamigaito,Lemao Liu,Taro Watanabe

类目:Computation and Language (cs.CL)

关键词:Automatic evaluation metrics, Automatic evaluation, essential for building, languages, Automatic

备注: 19 pages, 8 figures, ACL 2026 Findings

点击查看摘要

Abstract:Automatic evaluation metrics are essential for building multilingual translation systems. The common practice of evaluating these systems is averaging metric scores across languages, yet this is suspicious since metrics may suffer from cross-lingual scoring bias, where translations of equal quality receive different scores across languages. This problem has not been systematically studied because no benchmark exists that provides parallel-quality instances across languages, and expert annotation is not realistic. In this work, we propose XQ-MEval, a semi-automatically built dataset covering nine translation directions, to benchmark translation metrics. Specifically, we inject MQM-defined errors into gold translations automatically, filter them by native speakers for reliability, and merge errors to generate pseudo translations with controllable quality. These pseudo translations are then paired with corresponding sources and references to form triplets used in assessing the qualities of translation metrics. Using XQ-MEval, our experiments on nine representative metrics reveal the inconsistency between averaging and human judgment and provide the first empirical evidence of cross-lingual scoring bias. Finally, we propose a normalization strategy derived from XQ-MEval that aligns score distributions across languages, improving the fairness and reliability of multilingual metric evaluation.

27. 【2604.14930】IE as Cache: Information Extraction Enhanced Agentic Reasoning

链接https://arxiv.org/abs/2604.14930

作者:Hang Lv,Sheng Liang,Hongchao Gu,Wei Guo,Defu Lian,Yong Liu,Hao Wang,Enhong Chen

类目:Computation and Language (cs.CL)

关键词:Information Extraction aims, distill structured, unstructured text, aims to distill, decision-relevant information

备注: 8pages, 2figures

点击查看摘要

Abstract:Information Extraction aims to distill structured, decision-relevant information from unstructured text, serving as a foundation for downstream understanding and reasoning. However, it is traditionally treated merely as a terminal objective: once extracted, the resulting structure is often consumed in isolation rather than maintained and reused during multi-step inference. Moving beyond this, we propose \textit{IE-as-Cache}, a framework that repurposes IE as a cognitive cache to enhance agentic reasoning. Drawing inspiration from hierarchical computer memory, our approach combines query-driven extraction with cache-aware reasoning to dynamically maintain compact intermediate information and filter noise. Experiments on challenging benchmarks across diverse LLMs demonstrate significant improvements in reasoning accuracy, indicating that IE can be effectively repurposed as a reusable cognitive resource and offering a promising direction for future research on downstream uses of IE.

28. 【2604.14922】LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning

链接https://arxiv.org/abs/2604.14922

作者:Bowen Ping,Zijun Chen,Tingfeng Hui,Qize Yu,Chenxuan Li,Junchi Yan,Baobao Chang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Reinforcement Learning, Large Language, capabilities of Large, Language Models

备注

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such as GRPO and DAPO. Extensive ablation studies suggest that focusing on these salient features is key to unlocking long-context potential.

29. 【2604.14907】Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task

链接https://arxiv.org/abs/2604.14907

作者:Evaldas Vaiciukynas,Paulius Danenas,Linas Ablonskis,Algirdas Sukys,Edgaras Dambrauskas,Voldemaras Zitkus,Rita Butkiene,Rimantas Butleris

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Online hate speech, abusive language pose, Lithuanian hate speech, hate speech detection, hate speech

备注: Submitted to Applied Soft Computing (Status: Decision in Process)

点击查看摘要

Abstract:Online hate speech and abusive language pose a growing challenge for content moderation, especially in multilingual settings and for low-resource languages such as Lithuanian. This paper investigates to what extent modern multilingual sentence embedding models can support accurate hate speech detection in Lithuanian, Russian, and English, and how their performance depends on downstream modeling choices and feature dimensionality. We introduce LtHate, a new Lithuanian hate speech corpus derived from news portals and social networks, and benchmark six modern multilingual encoders (potion, gemma, bge, snow, jina, e5) on LtHate, RuToxic, and EnSuperset using a unified Python pipeline. For each embedding, we train both a one class HBOS anomaly detector and a two class CatBoost classifier, with and without principal component analysis (PCA) compression to 64-dimensional feature vectors. Across all datasets, two class supervised models consistently and substantially outperform one class anomaly detection, with the best configurations achieving up to 80.96% accuracy and AUC ROC of 0.887 in Lithuanian (jina), 92.19% accuracy and AUC ROC of 0.978 in Russian (e5), and 77.21% accuracy and AUC ROC of 0.859 in English (e5 with PCA). PCA compression preserves almost all discriminative power in the supervised setting, while showing some negative impact for the unsupervised anomaly detection case. These results demonstrate how modern multilingual sentence embeddings combined with gradient boosted decision trees provide robust soft-computing solutions for multilingual hate speech detection applications.

30. 【2604.14902】ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

链接https://arxiv.org/abs/2604.14902

作者:Pei-An Chen,Yong-Ching Liang,Jia-Fong Yeh,Hung-Ting Su,Yi-Ting Chen,Min Sun,Winston Hsu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:involve unexpected conditions, Intelligent embodied agents, simply follow instructions, Intelligent embodied, conditions and exceptions

备注

点击查看摘要

Abstract:Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

31. 【2604.14888】Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

链接https://arxiv.org/abs/2604.14888

作者:Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:information remains unclear, Recent advances, offer reasoning capabilities, vision language models, textual information remains

备注

点击查看摘要

Abstract:Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

32. 【2604.14885】RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

链接https://arxiv.org/abs/2604.14885

作者:Zihong Zhang,Zuchao Li,Lefei Zhang,Ping Wang,Hai Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, causing high inference, high inference latency

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face trade-offs: retrieval-based drafts break when no exact match exists, while logits-based drafts lack structural guidance. We propose $\textbf{RACER}$ ($\textbf{R}$etrieval-$\textbf{A}$ugmented $\textbf{C}$ont$\textbf{e}$xtual $\textbf{R}$apid Speculative Decoding), a lightweight and training-free method that integrates retrieved exact patterns with logit-driven future cues. This unification supplies both reliable anchors and flexible extrapolation, yielding richer speculative drafts. Experiments on Spec-Bench, HumanEval, and MGSM-ZH demonstrate that RACER consistently accelerates inference, achieving more than $2\times$ speedup over autoregressive decoding, and outperforms prior training-free methods, offering a scalable, plug-and-play solution for efficient LLM decoding. Our source code is available at $\href{this https URL}{this https URL}$.

33. 【2604.14865】Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

链接https://arxiv.org/abs/2604.14865

作者:Xuanli He,Bilgehan Sel,Faizan Ali,Jenny Bao,Hoagy Cunningham,Jerry Wei

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Large Language Models, Large Language, Language Models, high-stakes Chemical, adaptive jailbreaking

备注: preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.

34. 【2604.14862】Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding

链接https://arxiv.org/abs/2604.14862

作者:Yifan Le

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:JSON and XML, satisfy predefined formats, large language models, outputs satisfy predefined, Constrained decoding

备注: 10 pages, 2 figures. Work in progress

点击查看摘要

Abstract:Constrained decoding has been widely adopted for structured generation with large language models (LLMs), ensuring that outputs satisfy predefined formats such as JSON and XML. However, existing approaches largely treat schemas as purely structural constraints and overlook the possibility that their linguistic formulation may affect model behavior. In this work, we study how instruction placement influences model performance in structured generation and show that merely changing the wording of schema keys, without modifying the prompt or model parameters, can significantly alter model performance under constrained decoding. Based on this observation, we propose to reinterpret structured generation as a multi-channel instruction problem, where instructions can be conveyed explicitly through prompts and implicitly through schema keys during decoding. To the best of our knowledge, this is the first work to systematically study how schema key formulation acts as an implicit instruction channel and affects model performance under constrained decoding. Experiments on multiple mathematical reasoning benchmarks show that different model families exhibit distinct sensitivities to these instruction channels: Qwen models consistently benefit from schema-level instructions, while LLaMA models rely more heavily on prompt-level guidance. We further observe non-additive interaction effects between instruction channels, showing that combining multiple channels does not always lead to further improvement. These findings suggest that schema design not only determines output structure, but also carries instruction signals, offering a new perspective on structured generation in LLMs.

35. 【2604.14856】ClimateCause: Complex and Implicit Causal Structures in Climate Reports

链接https://arxiv.org/abs/2604.14856

作者:Liesbeth Allein,Nataly Pineda-Castañeda,Andrea Rocci,Marie-Francine Moens

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Understanding climate change, climate change requires, complex causal networks, change requires reasoning, Understanding climate

备注: Accepted to ACL 2026 [Findings]

点击查看摘要

Abstract:Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.

36. 【2604.14843】Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

链接https://arxiv.org/abs/2604.14843

作者:Yufeng Wu

类目:Computation and Language (cs.CL)

关键词:multiple linguistic dimensions, requires simultaneous coding, Behavioral Profile, linguistic dimensions, difficult to automate

备注

点击查看摘要

Abstract:Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.

37. 【2604.14828】Pangu-ACE: Adaptive Cascaded Experts for Educational Response Generation on EduBench

链接https://arxiv.org/abs/2604.14828

作者:Dinghao Li,Wenlong Zhou,Zhimin Chen,Yuehan Peng,Hong Ni,Chengfu Zou,Guoyu Shi,Yaochen Li

类目:Computation and Language (cs.CL)

关键词:Educational assistants, assistants should spend, spend more computation, Educational, draft

备注

点击查看摘要

Abstract:Educational assistants should spend more computation only when the task needs it. This paper rewrites our earlier draft around the system that was actually implemented and archived in the repository: a sample-level 1B to 7B cascade for the shared-8 EduBench benchmark. The final system, Pangu-ACE, uses a 1B tutor-router to produce a draft answer plus routing signals, then either accepts the draft or escalates the sample to a 7B specialist prompt. We also correct a major offline evaluation bug: earlier summaries over-credited some open-form outputs that only satisfied superficial format checks. After CPU-side rescoring from saved prediction JSONL, the full Chinese test archive (7013 samples) shows that cascade_final improves deterministic quality from 0.457 to 0.538 and format validity from 0.707 to 0.866 over the legacy rule_v2 system while accepting 19.7% of requests directly at 1B. Routing is strongly task dependent: IP is accepted by 1B 78.0% of the time, while QG and EC still escalate almost always. The current archived deployment does not yet show latency gains, so the defensible efficiency story is routing selectivity rather than wall-clock speedup. We also package a reproducible artifact-first paper workflow and clarify the remaining external-baseline gap: GPT-5.4 re-judging is implemented locally, but the configured provider endpoint and key are invalid, so final sampled-baseline alignment with GPT-5.4 remains pending infrastructure repair.

38. 【2604.14815】Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

链接https://arxiv.org/abs/2604.14815

作者:Rami Luisto,Liisa Petäinen,Tommi Grönholm,Jan Böhm,Maarit Ahtiainen,Tomi Lilja,Ilkka Pölönen,Sami Äyrämö

类目:Computation and Language (cs.CL)

关键词:NLP classification tasks, NLP classification, labeled data exists, established approach, classification tasks

备注

点击查看摘要

Abstract:In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.14815 [cs.CL]

(or
arXiv:2604.14815v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14815

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Rami Luisto [view email] [v1]
Thu, 16 Apr 2026 09:36:48 UTC (441 KB)

39. 【2604.14808】Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem

链接https://arxiv.org/abs/2604.14808

作者:Zeguan Xiao,Siqing Li,Yong Wang,Xuetao Wei,Jian Yang,Yun Chen,Guanhua Chen

类目:Computation and Language (cs.CL)

关键词:preserving general capability, remove targeted knowledge, Machine unlearning, large language models, aims to remove

备注: ACL 2026

点击查看摘要

Abstract:Machine unlearning for large language models (LLMs) aims to remove targeted knowledge while preserving general capability. In this paper, we recast LLM unlearning as an asymmetric two-task problem: retention is the primary objective and forgetting is an auxiliary. From this perspective, we propose a retention-prioritized gradient synthesis framework that decouples task-specific gradient extraction from conflict-aware combination. Instantiating the framework, we adapt established PCGrad to resolve gradient conflicts, and introduce SAGO, a novel retention-prioritized gradient synthesis method. Theoretically, both variants ensure non-negative cosine similarity with the retain gradient, while SAGO achieves strictly tighter alignment through constructive sign-constrained synthesis. Empirically, on WMDP Bio/Cyber and RWKU benchmarks, SAGO consistently pushes the Pareto frontier: e.g., on WMDP Bio (SimNPO+GD), recovery of target model MMLU performance progresses from 44.6% (naive) to 94.0% (+PCGrad) and further to 96.0% (+SAGO), while maintaining comparable forgetting strength. Our results show that re-shaping gradient geometry, rather than re-balancing losses, is the key to mitigating unlearning-retention trade-offs.

40. 【2604.14807】he LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows

链接https://arxiv.org/abs/2604.14807

作者:Hyunwoo Kim,Harin Yu,Hanau Yi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, perform cognitive tasks, multilingual communication, individuals perform cognitive, rapid integration

备注

点击查看摘要

Abstract:The rapid integration of large language models (LLMs) into everyday workflows has transformed how individuals perform cognitive tasks such as writing, programming, analysis, and multilingual communication. While prior research has focused on model reliability, hallucination, and user trust calibration, less attention has been given to how LLM usage reshapes users' perceptions of their own capabilities. This paper introduces the LLM fallacy, a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability. We argue that the opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them. We situate the LLM fallacy within existing literature on automation bias, cognitive offloading, and human--AI collaboration, while distinguishing it as a form of attributional distortion specific to AI-mediated workflows. We propose a conceptual framework of its underlying mechanisms and a typology of manifestations across computational, linguistic, analytical, and creative domains. Finally, we examine implications for education, hiring, and AI literacy, and outline directions for empirical validation. We also provide a transparent account of human--AI collaborative methodology. This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.

41. 【2604.14799】Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

链接https://arxiv.org/abs/2604.14799

作者:Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:recognizing evidence insufficiency, reliable multimodal systems, refraining from answering, insufficiency and refraining, critical for reliable

备注: 10 pages and 4 figures (excluding appendix)

点击查看摘要

Abstract:Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

42. 【2604.14779】AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

链接https://arxiv.org/abs/2604.14779

作者:Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:visual question answering, existing Continual Learning, unimodal architectures, question answering, built for symmetric

备注: 18 pages, 9 figures. Submitted to ACM MM 2026

点击查看摘要

Abstract:In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

43. 【2604.14773】CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors

链接https://arxiv.org/abs/2604.14773

作者:Hang Su,Zequn Liu,Chen Hu,Xuesong Lu,Yingce Xia,Zhen Liu

类目:Computation and Language (cs.CL)

关键词:Question Answering, demonstrated remarkable potential, potential in Question, critical bottleneck, LLMs have demonstrated

备注: Accepted to ACL. 30 pages, 10 figures

点击查看摘要

Abstract:While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on lexical-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD), where individual choices override consensus, to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at this https URL.

44. 【2604.14749】Which bird does not have wings: Negative-constrained KGQA with Schema-guided Semantic Matching and Self-directed Refinement

链接https://arxiv.org/abs/2604.14749

作者:Midan Shim,Seokju Hwang,Kaehyun Um,Kyong-Ho Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, remarkable reasoning abilities, Large language, Graph Question Answering, Knowledge Graph

备注: ACL 2026 findings

点击查看摘要

Abstract:Large language models still struggle with faithfulness and hallucinations despite their remarkable reasoning abilities. In Knowledge Graph Question Answering (KGQA), semantic parsing-based approaches address the limitations by understanding constraints in a user's question and converting them into a logical form to execute on a knowledge graph. However, existing KGQA benchmarks and methods are biased toward positive and calculation constraints. Negative constraints are neglected, although they frequently appear in real-world questions. In this paper, we introduce a new task, NEgative-conSTrained (NEST) KGQA, where each question contains at least one negative constraint, and a corresponding dataset, NestKGQA. We also design PyLF, a Python-formatted logical form, since existing logical forms are hardly suitable to express negation clearly while maintaining readability. Furthermore, NEST questions naturally contain multiple constraints. To mitigate their semantic complexity, we present a novel framework named CUCKOO, specialized to multiple-constrained questions and ensuring semantic executability. CUCKOO first generates a constraint-aware logical form draft and performs schema-guided semantic matching. It then selectively applies self-directed refinement only when executing improper logical forms yields an empty result, reducing cost while improving robustness. Experimental results demonstrate that CUCKOO consistently outperforms baselines on both conventional KGQA and NEST-KGQA benchmarks under few-shot settings.

45. 【2604.14691】CAMO: An Agentic Framework for Automated Causal Discovery from Micro Behaviors to Macro Emergence in LLM Agent Simulations

链接https://arxiv.org/abs/2604.14691

作者:Xiangning Yu,Yuwei Guo,Yuqi Hou,Xiao Xue,Qun Ma

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:study social emergence, textbf, CAMO, remain unclear, LLM-empowered agent simulations

备注

点击查看摘要

Abstract:LLM-empowered agent simulations are increasingly used to study social emergence, yet the micro-to-macro causal mechanisms behind macro outcomes often remain unclear. This is challenging because emergence arises from intertwined agent interactions and meso-level feedback and nonlinearity, making generative mechanisms hard to disentangle. To this end, we introduce \textbf{\textsc{CAMO}}, an automated \textbf{Ca}usal discovery framework from \textbf{M}icr\textbf{o} behaviors to \textbf{M}acr\textbf{o} Emergence in LLM agent simulations. \textsc{CAMO} converts mechanistic hypotheses into computable factors grounded in simulation records and learns a compact causal representation centered on an emergent target $Y$. \textsc{CAMO} outputs a computable Markov boundary and a minimal upstream explanatory subgraph, yielding interpretable causal chains and actionable intervention levers. It also uses simulator-internal counterfactual probing to orient ambiguous edges and revise hypotheses when evidence contradicts the current view. Experiments across four emergent settings demonstrate the promise of \textsc{CAMO}.

46. 【2604.14682】Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

链接https://arxiv.org/abs/2604.14682

作者:Saif Mahmoud

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Speculative decoding accelerates, model, draft model, Speculative decoding, Speculative

备注

点击查看摘要

Abstract:Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains (rho in [-0.20, -0.15]). Counterintuitively, chat produces the highest entropy yet the highest acceptance rate. We attribute this divergence to the lexical predictability of RLHF-aligned register. These findings have direct implications for domain-aware speculation budgets and draft-model selection strategies. Index Terms--speculative decoding, large language model inference, tree attention, draft model, acceptance probability, LLM efficiency

47. 【2604.14672】SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

链接https://arxiv.org/abs/2604.14672

作者:Binxian Su,Haoye Lou,Shucheng Zhu,Weikang Wang,Ying Liu,Dong Yu,Pengyuan Liu

类目:Computation and Language (cs.CL)

关键词:gendered space theory, space theory highlights, Large language models, Large language, gendered space

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) are being increasingly used in urban planning, but since gendered space theory highlights how gender hierarchies are embedded in spatial organization, there is concern that LLMs may reproduce or amplify such biases. We introduce SPAGBias - the first systematic framework to evaluate spatial gender bias in LLMs. It combines a taxonomy of 62 urban micro-spaces, a prompt library, and three diagnostic layers: explicit (forced-choice resampling), probabilistic (token-level asymmetry), and constructional (semantic and narrative role analysis). Testing six representative models, we identify structured gender-space associations that go beyond the public-private divide, forming nuanced micro-level mappings. Story generation reveals how emotion, wording, and social roles jointly shape "spatial gender narratives". We also examine how prompt design, temperature, and model scale influence bias expression. Tracing experiments indicate that these patterns are embedded and reinforced across the model pipeline (pre-training, instruction tuning, and reward modeling), with model associations found to substantially exceed real-world distributions. Downstream experiments further reveal that such biases produce concrete failures in both normative and descriptive application settings. This work connects sociological theory with computational analysis, extending bias research into the spatial domain and uncovering how LLMs encode social gender cognition through language.

48. 【2604.14656】Rethinking Patient Education as Multi-turn Multi-modal Interaction

链接https://arxiv.org/abs/2604.14656

作者:Zonghai Yao,Zhipeng Tang,Chengtao Lin,Xiong Luo,Benlu Wang,Juncheng Huang,Chin Siang Ong,Hong Yu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:focus on static, static tasks, image question answering, Patient education, multimodal benchmarks focus

备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

49. 【2604.14651】CURA: Clinical Uncertainty Risk Alignment for Language Model-Based Risk Prediction

链接https://arxiv.org/abs/2604.14651

作者:Sizhe Wang,Ziqi Xu,Claire Najjuuko,Charles Alba,Chenyang Lu

类目:Computation and Language (cs.CL)

关键词:remain poorly calibrated, Clinical language models, Uncertainty Risk Alignment, language models, free-text notes

备注: Accepted at ACL 2026 Main Conference

点击查看摘要

Abstract:Clinical language models (LMs) are increasingly applied to support clinical risk prediction from free-text notes, yet their uncertainty estimates often remain poorly calibrated and clinically unreliable. In this work, we propose Clinical Uncertainty Risk Alignment (CURA), a framework that aligns clinical LM-based risk estimates and uncertainty with both individual error likelihoods and cohort-level ambiguities. CURA first fine-tunes domain-specific clinical LMs to obtain task-adapted patient embeddings, and then performs uncertainty fine-tuning of a multi-head classifier using a bi-level uncertainty objective. Specifically, an individual-level calibration term aligns predictive uncertainty with each patient's likelihood of error, while a cohort-aware regularizer pulls risk estimates toward event rates in their local neighborhoods in the embedding space and places extra weight on ambiguous cohorts near the decision boundary. We further show that this cohort-aware term can be interpreted as a cross-entropy loss with neighborhood-informed soft labels, providing a label-smoothing view of our method. Extensive experiments on MIMIC-IV clinical risk prediction tasks across various clinical LMs show that CURA consistently improves calibration metrics without substantially compromising discrimination. Further analysis illustrates that CURA reduces overconfident false reassurance and yields more trustworthy uncertainty estimates for downstream clinical decision support.

50. 【2604.14644】CURaTE: Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge

链接https://arxiv.org/abs/2604.14644

作者:Seyun Bae,Seokhan Lee,Eunho Yang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:potentially problematic data, unlearning specific pieces, inability to filter, advance all potentially, potentially problematic

备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:The inability to filter out in advance all potentially problematic data from the pre-training of large language models has given rise to the need for methods for unlearning specific pieces of knowledge after training. Existing techniques overlook the need for continuous and immediate action, causing them to suffer from degraded utility as updates accumulate and protracted exposure of sensitive information. To address these issues, we propose Continual Unlearning in Real Time with Ensured Preservation of LLM Knowledge (CURaTE). Our method begins by training a sentence embedding model on a dataset designed to enable the formation of sharp decision boundaries for determining whether a given input prompt corresponds to any stored forget requests. The similarity of a given input to the forget requests is then used to determine whether to answer or return a refusal response. We show that even with such a simple approach, not only does CURaTE achieve more effective forgetting than existing methods, but by avoiding modification of the language model parameters, it also maintains near perfect knowledge preservation over any number of updates and is the only method capable of continual unlearning in real-time.

51. 【2604.14640】Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

链接https://arxiv.org/abs/2604.14640

作者:Cuong Hoang,Le-Minh Nguyen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:critical information asymmetry, creating critical information, misleading market behavior, financial misinformation poses, market stability

备注

点击查看摘要

Abstract:The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at this https URL.

52. 【2604.14634】Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options

链接https://arxiv.org/abs/2604.14634

作者:Nahyun Lee,Guijin Son

类目:Computation and Language (cs.CL)

关键词:Multiple choice evaluation, Multiple choice, benchmarking large language, ceiling accuracy, sustained by shortcut

备注

点击查看摘要

Abstract:Multiple choice evaluation is widely used for benchmarking large language models, yet near ceiling accuracy in low option settings can be sustained by shortcut strategies that obscure true competence. Therefore, we propose a massive option evaluation protocol that scales the candidate set to one hundred options and sharply reduces the impact of chance performance. We apply this framework to a Korean orthography error detection task where models must pick the single incorrect sentence from a large candidate set. With fixed targets and repeated resampling and shuffling, we obtain stable estimates while separating content driven failures from positional artifacts. Across experiments, results indicate that strong performance in low option settings can overstate model competence. This apparent advantage often weakens under dense interference at high $N$, revealing gaps that conventional benchmarks tend to obscure. We identify two failure modes, semantic confusion and position bias toward early options under uncertainty. To isolate the effect of context length, we run padding controlled and length matched tests, which suggest that the main bottleneck is candidate ranking rather than context length. Together, these findings support massive option evaluation as a general framework for stress testing model reliability under extreme distractor density, beyond what low option benchmarks can reveal.

53. 【2604.14631】StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation

链接https://arxiv.org/abs/2604.14631

作者:Geonhui Jang,Dongyoon Han,YoungJoon Yoo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Effective code generation, Effective code, reason and plan, code generation requires, Effective

备注: 21 pages, 12 figures. ACL 2026 Main Conference

点击查看摘要

Abstract:Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at this https URL.

54. 【2604.14616】Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

链接https://arxiv.org/abs/2604.14616

作者:Sumit Mukherjee,Juan Shu,Nairwita Mazumder,Tate Kernell,Celena Wheeler,Shannon Hastings,Chris Sidey-Gibbons

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:clinical quality measurement, clinical concept, measurement and phenotyping, clinical quality, recurring bottleneck

备注

点击查看摘要

Abstract:Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{this https URL}{this https URL}.

55. 【2604.14612】ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

链接https://arxiv.org/abs/2604.14612

作者:Walaa Amer,Uday das,Fadi Kurdahi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:sacrificing output quality, large language models, language models designed, draft model, output quality

备注: 13 pages, 9 figures

点击查看摘要

Abstract:Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model with selective re-evaluation by the full target model. Some existing methods form the draft model by dynamically learning which layers to skip during inference, effectively creating a smaller subnetwork to speed up computation. However, using heuristic-based approaches to select layers to skip can often be simpler and more effective. In this paper, we propose ConfLayers, a dynamic plug-and-play approach to forming the draft model in self-speculative decoding via confidence-based intermediate layer skipping. The process iteratively computes confidence scores for all layers, selects layers to skip based on an adaptive threshold, evaluates the performance of the resulting set, and updates the best selection until no further improvement is achieved or a maximum number of iterations is reached. This framework avoids the overhead and complexity of training a layer skipping policy and can provide more consistent speed-quality trade-offs while preserving the adaptivity of the draft model to diverse tasks and datasets. The performance evaluation of ConfLayers across different models and datasets shows that our novel approach offers up to 1.4x speedup over vanilla LLM generation.

56. 【2604.14602】CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

链接https://arxiv.org/abs/2604.14602

作者:Yian Wang,Yuen Chen,Agam Goyal,Hari Sundaram

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, posing significant risks, Large language, frequently generate toxic, generate toxic content

备注: Accepted to ACL 2026. 22 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.

57. 【2604.14595】NLP needs Diversity outside of 'Diversity'

链接https://arxiv.org/abs/2604.14595

作者:Joshua Tint

类目:Computation and Language (cs.CL)

关键词:position paper argues, areas surrounding fairness, surrounding fairness, position paper, recent progress

备注: 7 pages, 1 figure

点击查看摘要

Abstract:This position paper argues that recent progress with diversity in NLP is disproportionately concentrated on a small number of areas surrounding fairness. We further argue that this is the result of a number of incentives, biases, and barriers which come together to disenfranchise marginalized researchers in non-fairness fields, or to move them into fairness-related fields. We substantiate our claims with an investigation into the demographics of NLP researchers by subfield, using our research to support a number of recommendations for ensuring that all areas within NLP can become more inclusive and equitable. In particular, we highlight the importance of breaking down feedback loops that reinforce disparities, and the need to address geographical and linguistic barriers that hinder participation in NLP research.

58. 【2604.14593】Mechanistic Decoding of Cognitive Constructs in LLMs

链接https://arxiv.org/abs/2604.14593

作者:Yitong Shou,Manhao Guan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, sophisticated affective capabilities, emotions remain unclear, increasingly sophisticated affective

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and Gemma families suggest that models natively encode jealousy as a structured linear combination of these constituent factors. Their internal representations are broadly consistent with the human psychological construct, treating Superiority as the foundational trigger and Relevance as the ultimate intensity multiplier. Our framework also demonstrates that toxic emotional states can be mechanically detected and surgically suppressed, suggesting a possible route toward representational monitoring and intervention for AI safety in multi-agent environments.

59. 【2604.14585】Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

链接https://arxiv.org/abs/2604.14585

作者:Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Amazon Nova Lite, Claude Haiku, Nova Lite, Amazon Nova, times

备注

点击查看摘要

Abstract:Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ($p 0.52$, all $F 1.0$), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.

60. 【2604.14572】Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

链接https://arxiv.org/abs/2604.14572

作者:Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:grounds LLM responses, Retrieval-Augmented Generation, grounds LLM, LLM responses, limiting its ability

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird's-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics.

61. 【2604.14568】Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

链接https://arxiv.org/abs/2604.14568

作者:Yixu Huang,Tinghui Zhu,Muhao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:recently shown strong, shown strong cross-modal, strong cross-modal reasoning, cross-modal reasoning capabilities, Visual reasoning

备注

点击查看摘要

Abstract:Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: this https URL.

62. 【2604.14564】MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

链接https://arxiv.org/abs/2604.14564

作者:Pengfei Li,Shijie Wang,Fangyuan Li,Yikun Fu,Kaifeng Liu,Kaiyan Zhang,Dazhi Zhang,Yuqiang Li,Biqing Qi,Bowen Zhou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:demonstrated strong performance, paradigms have demonstrated, demonstrated strong, reasoning-intensive tasks, strong performance

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose \textbf{MARS$^2$} (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS$^2$ models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS$^2$ consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at this https URL.

63. 【2604.14528】Dissecting Failure Dynamics in Large Language Model Reasoning

链接https://arxiv.org/abs/2604.14528

作者:Wei Zhu,Jian Zhang,Lixing Yu,Kun Yue,Zhiwen Tang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, achieve strong performance, remains poorly understood

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) achieve strong performance through extended inference-time deliberation, yet how their reasoning failures arise remains poorly understood. By analyzing model-generated reasoning trajectories, we find that errors are not uniformly distributed but often originate from a small number of early transition points, after which reasoning remains locally coherent but globally incorrect. These transitions coincide with localized spikes in token-level entropy, and alternative continuations from the same intermediate state can still lead to correct solutions. Based on these observations, we introduce GUARD, a targeted inference-time framework that probes and redirects critical transitions using uncertainty signals. Empirical evaluations across multiple benchmarks confirm that interventions guided by these failure dynamics lead to more reliable reasoning outcomes. Our findings highlight the importance of understanding when and how reasoning first deviates, complementing existing approaches that focus on scaling inference-time computation.

64. 【2604.14513】PeerPrism: Peer Evaluation Expertise vs Review-writing AI

链接https://arxiv.org/abs/2604.14513

作者:Soroush Sadeghian,Alireza Daqiq,Radin Cheraghi,Sajad Ebrahimi,Negar Arabzadeh,Ebrahim Bagheri

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, assisting with drafting, scientific peer review

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at this https URL.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.14513 [cs.CL]

(or
arXiv:2604.14513v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14513

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1145/3805712.3808602

Focus to learn more

            DOI(s) linking to related resources</p>
65. 【2604.14489】CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling

链接https://arxiv.org/abs/2604.14489

作者:Karthik Singaravadivelan,Anant Gupta,Zekun Wang,Christopher MacLellan

类目:Computation and Language (cs.CL)

关键词:uncover latent semantic, latent semantic structure, minimal supervision, seeks to uncover, uncover latent

备注: 16 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Topic modeling seeks to uncover latent semantic structure in text corpora with minimal supervision. Neural approaches achieve strong performance but require extensive tuning and struggle with lifelong learning due to catastrophic forgetting and fixed capacity, while classical probabilistic models lack flexibility and adaptability to streaming data. We introduce \textsc{CobwebTM}, a low-parameter lifelong hierarchical topic model based on incremental probabilistic concept formation. By adapting the Cobweb algorithm to continuous document embeddings, \textsc{CobwebTM} constructs semantic hierarchies online, enabling unsupervised topic discovery, dynamic topic creation, and hierarchical organization without predefining the number of topics. Across diverse datasets, \textsc{CobwebTM} achieves strong topic coherence, stable topics over time, and high-quality hierarchies, demonstrating that incremental symbolic concept formation combined with pretrained representations is an efficient approach to topic modeling.

66. 【2604.14488】Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge

链接https://arxiv.org/abs/2604.14488

作者:Andre Bacellar

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:remaining semantically distant, Controlling Authority Retrieval, knowledge accumulates, accumulates under formal, document can formally

备注: 23 pages, 13 tables; code and data at [this https URL](https://github.com/andremir/car-retrieval)

点击查看摘要

Abstract:In any domain where knowledge accumulates under formal authority -- law, drug regulation, software security -- a later document can formally void an earlier one while remaining semantically distant from it. We formalize this as Controlling Authority Retrieval (CAR): recovering the active frontier front(cl(A_k(q))) of the authority closure of the semantic anchor set -- a different mathematical problem from argmax_d s(q,d). The two central results are: Theorem 4 (CAR-Correctness Characterization) gives necessary-and-sufficient conditions on any retrieved set R for TCA(R,q)=1 -- frontier inclusion and no-ignored-superseder -- independent of how R was produced. Proposition 2 (Scope Identifiability Upper Bound) establishes phi(q) as a hard worst-case ceiling: for any scope-indexed algorithm, TCA@k = phi(q) * R_anchor(q), proved by an adversarial permutation argument. Three independent real-world corpora validate the proved structure: security advisories (Dense TCA@5=0.270, two-stage 0.975), SCOTUS overruling pairs (Dense=0.172, two-stage 0.926), FDA drug records (Dense=0.064, two-stage 0.774). A GPT-4o-mini experiment shows the downstream cost: Dense RAG produces explicit "not patched" claims for 39% of queries where a patch exists; Two-Stage cuts this to 16%. Four benchmark datasets, domain adapters, and a single-command scorer are released at this https URL.

67. 【2604.14463】Psychological Steering of Large Language Models

链接https://arxiv.org/abs/2604.14463

作者:Leonardo Blas,Robin Jia,Emilio Ferrara

类目:Computation and Language (cs.CL)

关键词:Large language models, consistent human-like behavior, Large language, emulate a consistent, consistent human-like

备注: 66 pages, 60 images

点击查看摘要

Abstract:Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

68. 【2604.14459】Filling in the Mechanisms: How do LMs Learn Filler-Gap Dependencies under Developmental Constraints?

链接https://arxiv.org/abs/2604.14459

作者:Atrey Desai,Sathvik Nair

类目:Computation and Language (cs.CL)

关键词:Distributed Alignment Search, syntactic constructions, applied Distributed Alignment, Alignment Search, Distributed Alignment

备注: To be published in the 64th Annual Meeting of the Association for Computational Linguistics

点击查看摘要

Abstract:For humans, filler-gap dependencies require a shared representation across different syntactic constructions. Although causal analyses suggest this may also be true for LLMs (Boguraev et al., 2025), it is still unclear if such a representation also exists for language models trained on developmentally feasible quantities of data. We applied Distributed Alignment Search (DAS, Geiger et al. (2024)) to LMs trained on varying amounts of data from the BabyLM challenge (Warstadt et al., 2023), to evaluate whether representations of filler-gap dependencies transfer between wh-questions and topicalization, which greatly vary in terms of their input frequency. Our results suggest shared, yet item-sensitive mechanisms may develop with limited training data. More importantly, LMs still require far more data than humans to learn comparable generalizations, highlighting the need for language-specific biases in models of language acquisition.

69. 【2604.14448】MARCA: A Checklist-Based Benchmark for Multilingual Web Search

链接https://arxiv.org/abs/2604.14448

作者:Thales Sales Almeida,Giovana Kerche Bonás,Ramon Pires,Celio Larcher,Hugo Abonizio,Marcos Piau,Roseval Malaquias Junior,Rodrigo Nogueira,Thiago Laitz

类目:Computation and Language (cs.CL)

关键词:select relevant evidence, synthesize complete answers, select relevant, relevant evidence, reliability depends

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as sources of information, yet their reliability depends on the ability to search the web, select relevant evidence, and synthesize complete answers. While recent benchmarks evaluate web-browsing and agentic tool use, multilingual settings, and Portuguese in particular, remain underexplored. We present \textsc{MARCA}, a bilingual (English and Portuguese) benchmark for evaluating LLMs on web-based information seeking. \textsc{MARCA} consists of 52 manually authored multi-entity questions, paired with manually validated checklist-style rubrics that explicitly measure answer completeness and correctness. We evaluate 14 models under two interaction settings: a Basic framework with direct web search and scraping, and an Orchestrator framework that enables task decomposition via delegated subagents. To capture stochasticity, each question is executed multiple times and performance is reported with run-level uncertainty. Across models, we observe large performance differences, find that orchestration often improves coverage, and identify substantial variability in how models transfer from English to Portuguese. The benchmark is available at this https URL

70. 【2604.14442】Hierarchical vs. Flat Iteration in Shared-Weight Transformers

链接https://arxiv.org/abs/2604.14442

作者:Sang-Il Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Transformer-based language model, Transformer-based language, hierarchically structured, shared-weight recurrence, language model

备注

点击查看摘要

Abstract:We present an empirical study of whether hierarchically structured, shared-weight recurrence can match the representational quality of independent-layer stacking in a Transformer-based language model. HRM-LM replaces L independent Transformer layers with a two-speed recurrent pair: a Fast module operating at every step for local refinement, and a Slow module operating every T steps for global compression. This recurrent hierarchy is unrolled for M = N x T steps with shared parameters. The central and most robust finding, supported by a parameter-matched Universal Transformer ablation (UniTF, 1.2B) across five independent runs, is a sharp empirical gap between the two approaches.

71. 【2604.14430】hree-Phase Transformer

链接https://arxiv.org/abs/2604.14430

作者:Mohammad R. Abu Ayyash

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:present Three-Phase Transformer, residual-stream structural prior, decoder-only Transformers, GQA backbone, constraint aligning GQA

备注: 48 pages, 20 figures, 23 tables. Code: [this https URL](https://github.com/achelousace/three-phase-transformer)

点击查看摘要

Abstract:We present Three-Phase Transformer (3PT), a residual-stream structural prior for decoder-only Transformers on a standard SwiGLU + RMSNorm + RoPE + GQA backbone. The hidden vector is partitioned into N equally-sized cyclic channels, each maintained by phase-respecting ops: a per-channel RMSNorm, a 2D Givens rotation between attention and FFN that rotates each channel by theta + i*(2*pi/N), and a head-count constraint aligning GQA heads with the partition. The architecture is a self-stabilizing equilibrium between scrambling and re-imposition, not a bolted-on module. The partition carves out a one-dimensional DC subspace orthogonal to the channels, into which we inject a fixed Gabriel's horn profile r(p) = 1/(p+1) as an absolute-position side-channel composing orthogonally with RoPE's relative-position rotation. The canonical N=3 borrows its metaphor from balanced three-phase AC, where three sinusoids 120 degrees apart sum to zero with no anti-correlated pair. At 123M parameters on WikiText-103, 3PT achieves -7.20% perplexity (-2.62% bits-per-byte) over a matched RoPE-Only baseline at +1,536 parameters (0.00124% of total), with 1.93x step-count convergence speedup (1.64x wall-clock). N behaves as a parameter-sharing knob rather than a unique optimum: at 5.5M an N-sweep over {1,2,3,4,6,8,12} is near-monotone with N=1 winning; at 123M a three-seed sweep finds N=3 and N=1 statistically indistinguishable. The load-bearing mechanism is the channel-partitioned residual stream, per-block rotation, per-phase normalization, and horn DC injection. We characterize (a) self-stabilization of the geometry without explicit enforcement, a novel instance of the conservation-law framework for neural networks; (b) a U-shaped depth profile of rotation-angle drift at 12 layers; (c) orthogonal composition with RoPE, attention, and FFN.

72. 【2604.14414】he Autocorrelation Blind Spot: Why 42% of Turn-Level Findings in LLM Conversation Analysis May Be Spurious

链接https://arxiv.org/abs/2604.14414

作者:Ferdinand M. Schessl

类目:Computation and Language (cs.CL)

关键词:dialogue quality, multi-turn human-LLM conversations, evaluate properties, safety and sycophancy, sycophancy to dialogue

备注: 14 pages, 3 figures, 5 tables, 1 algorithm. Code and synthetic demonstration data: [this https URL](https://github.com/ferdinandschessl-boop/autocorrelation-correction)

点击查看摘要

Abstract:Turn-level metrics are widely used to evaluate properties of multi-turn human-LLM conversations, from safety and sycophancy to dialogue quality. However, consecutive turns within a conversation are not statistically independent -- a fact that virtually all current evaluation pipelines fail to correct for in their statistical inference. We systematically characterize the autocorrelation structure of 66 turn-level metrics across 202 multi-turn conversations (11,639 turn pairs, 5 German-speaking users, 4 LLM platforms) and demonstrate that naive pooled analysis produces severely inflated significance estimates: 42% of associations that appear significant under standard pooled testing fail to survive cluster-robust correction. The inflation varies substantially across categories rather than scaling linearly with autocorrelation: three memoryless families (embedding velocity, directional, differential) aggregate to 14%, while the seven non-memoryless families (thermo-cycle, frame distance, lexical/structural, rolling windows, cumulative, interaction, timestamp) aggregate to 33%, with individual category rates ranging from 0% to 100% depending on per-family effect size. We present a two-stage correction framework combining Chelton (1983) effective degrees of freedom with conversation-level block bootstrap, and validate it on a pre-registered hold-out split where cluster-robust metrics replicate at 57% versus 30% for pooled-only metrics. We provide concrete design principles, a publication checklist, and open-source code for the correction pipeline. A survey of ~30 recent papers at major NLP and AI venues that compute turn-level statistics in LLM evaluations finds that only 4 address temporal dependence at all, and 26 do not correct for it.

73. 【2604.14397】Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

链接https://arxiv.org/abs/2604.14397

作者:David Basil,Chirooth Girigowda,Bradley Hauer,Sahir Momin,Ning Shi,Grzegorz Kondrak

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:automatically expanding WordNet-style, expanding WordNet-style lexical, study the task, task of automatically, automatically expanding

备注: To be published in the proceedings of Canadian AI 2026

点击查看摘要

Abstract:We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects English synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate these alignments and ensure their quality, we augment a pre-trained base aligner with a bilingual dictionary, which is also used to filter out incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and requiring few external resources. We plan to make our code, documentation, and generated sense inventories accessible.

74. 【2604.14389】BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking

链接https://arxiv.org/abs/2604.14389

作者:Hyunkyung Park,Arkaitz Zubiaga

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:involves multi-turn conversations, dialogue involves multi-turn, frequent yet understudied, Automated fact-checking, involves multi-turn

备注: 15 pages, 7 figures. Published in FEVER 2026

点击查看摘要

Abstract:Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. This gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification. On the DialFact benchmark, our approach improves retrieval and verification, with particularly strong gains on SUPPORTS, and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.

75. 【2604.14363】he Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

链接https://arxiv.org/abs/2604.14363

作者:Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains poorly understood, failure remains poorly, models systematically underperform, poorly understood, systematically underperform

备注: 29 pages, 9 figures, 19 tables

点击查看摘要

Abstract:Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

76. 【2604.14362】APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

链接https://arxiv.org/abs/2604.14362

作者:Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, simply enlarging context, enlarging context windows, applying naive retrieval, Large language

备注: Accepted to ACL 2026 Mains

点击查看摘要

Abstract:Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.

77. 【2604.14356】When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

链接https://arxiv.org/abs/2604.14356

作者:Apoorv Prasad,Susan McRoy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:polycystic ovary syndrome, face substantially elevated, body image distress, identify co-occurring presentations, substantially elevated risks

备注

点击查看摘要

Abstract:Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

78. 【2604.14339】Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

链接https://arxiv.org/abs/2604.14339

作者:Zichong Li,Chen Liang,Liliang Ren,Tuo Zhao,Yelong Shen,Weizhu Chen

类目:Computation and Language (cs.CL)

关键词:reliable long-context understanding, Large language models, require reliable long-context, Large language, increasingly operate

备注

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate in settings that require reliable long-context understanding, such as retrieval-augmented generation and multi-document reasoning. A common strategy is to fine-tune pretrained short-context models at the target sequence length. However, we find that standard long-context adaptation can remain brittle: model accuracy depends strongly on the absolute placement of relevant evidence, exhibiting high positional variance even when controlling for task format and difficulty. We propose RoPE-Perturbed Self-Distillation, a training regularizer that improves positional robustness. The core idea is to form alternative "views" of the same training sequence by perturbing its RoPE indices -- effectively moving parts of the context to different positions -- and to train the model to produce consistent predictions across views via self-distillation. This encourages reliance on semantic signals instead of brittle position dependencies. Experiments on long-context adaptation of Llama-3-8B and Qwen-3-4B demonstrate consistent gains on long-context benchmarks, including up to 12.04% improvement on RULER-64K for Llama-3-8B and 2.71% on RULER-256K for Qwen-3-4B after SFT, alongside improved length extrapolation beyond the training context window.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.14339 [cs.CL]

(or
arXiv:2604.14339v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14339

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
79. 【2604.14325】Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

链接https://arxiv.org/abs/2604.14325

作者:Bar Alon,Itamar Zimerman,Lior Wolf

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:achieve strong performance, revolutionized NLP, Large language models, Large language, achieve strong

备注: 24 pages, multiple figures (e.g., at least 6 main figures), includes experiments across several benchmarks (MMLU, CommonsenseQA, SciQ, ARC, OpenBookQA); code available on GitHub

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

80. 【2604.14324】Purging the Gray Zone: Latent-Geometric Denoising for Precise Knowledge Boundary Awareness

链接https://arxiv.org/abs/2604.14324

作者:Hao An,Yibin Lou,Jiayi Guo,Yang Xu

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, exhibit hallucinations due, inability to accurately, accurately perceive

备注: ACL 2026 Findings

点击查看摘要

Abstract:Large language models (LLMs) often exhibit hallucinations due to their inability to accurately perceive their own knowledge boundaries. Existing abstention fine-tuning methods typically partition datasets directly based on response accuracy, causing models to suffer from severe label noise near the decision boundaries and consequently exhibit high rates of abstentions or hallucinations. This paper adopts a latent space representation perspective, revealing a "gray zone" near the decision hyperplane where internal belief ambiguity constitutes the core performance bottleneck. Based on this insight, we propose the **GeoDe** (**Geo**metric **De**noising) framework for abstention fine-tuning. This method constructs a truth hyperplane using linear probes and performs "geometric denoising" by employing geometric distance as a confidence signal for abstention decisions. This approach filters out ambiguous boundary samples while retaining high-fidelity signals for fine-tuning. Experiments across multiple models (Llama3, Qwen3) and benchmark datasets (TriviaQA, NQ, SciQ, SimpleQA) demonstrate that GeoDe significantly enhances model truthfulness and demonstrates strong generalization in out-of-distribution (OOD) scenarios. Code is available at this https URL.

81. 【2604.14321】LLM Predictive Scoring and Validation: Inferring Experience Ratings from Unstructured Text

链接https://arxiv.org/abs/2604.14321

作者:Jason Potteiger,Andrew Hong,Ito Zapata

类目:Computation and Language (cs.CL)

关键词:0-10 survey scale, Major League Baseball, League Baseball teams, ratings, experience

备注: 29 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We tasked GPT-4.1 to read what baseball fans wrote about their game-day experience and predict the overall experience rating each fan gave on a 0-10 survey scale. The model received only the text of a single open-ended response. These AI predictions were compared with the actual experience ratings captured by the survey instrument across approximately 10,000 fan responses from five Major League Baseball teams. In total two-thirds of predicted ratings fell within one point of self-reported fan ratings (67% within +/-1, 36% exact match), and the predicted measurement was near-deterministic across three independent scoring runs (87% exact agreement, 99.9% within +/-1). Predicted ratings aligned most strongly with the overall experience rating (r = 0.82) rather than with any specific aspect of the game-day experience such as parking, concessions, staff, etc. However, predictions were systematically lower than self-reported ratings by approximately one point, and this gap was not driven by any single aspect. Rather, our analysis shows that self-reported ratings capture the fan's verdict, an overall evaluative judgment that integrates the entire experience. While predicted ratings quantify the impact of salient moments characterized as memorable, emotionally intense, unusual, or actionable. Each measure contains information the other misses. These baseline results establish that a simple, unoptimized prompt can directionally predict how fans rate their experience from the text a fan wrote and that a gap between the two numbers can be interpreted as a construct difference worth preserving rather than an error to eliminate.

82. 【2604.14315】racking the Temporal Dynamics of News Coverage of Catastrophic and Violent Events

链接https://arxiv.org/abs/2604.14315

作者:Emily Lugos,Maurício Gruppi

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:modern news cycle, fundamentally reshaped, rapid exchange, media framing shifts, social reactions emerge

备注

点击查看摘要

Abstract:The modern news cycle has been fundamentally reshaped by the rapid exchange of information online. As a result, media framing shifts dynamically as new information, political responses, and social reactions emerge. Understanding how these narratives form, propagate, and evolve is essential for interpreting public discourse during moments of crisis. In this study, we examine the temporal and semantic dynamics of reporting for violent and catastrophic events using a large-scale corpus of 126,602 news articles collected from online publishers. We quantify narrative change through publication volume, semantic drift, semantic dispersion, and term relevance. Our results show that sudden events of impact exhibit structured and predictable news-cycle patterns characterized by rapid surges in coverage, early semantic drift, and gradual declines toward the baseline. In addition, our results indicate the terms that are driving the temporal patterns.

83. 【2604.14314】DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

链接https://arxiv.org/abs/2604.14314

作者:Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:specialized small language, jointly optimize transcription, optimize transcription quality, small language models, introduces DharmaOCR Full

备注

点击查看摘要

Abstract:This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.

84. 【2604.14306】EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

链接https://arxiv.org/abs/2604.14306

作者:Francesco Andrea Causio,Vittorio De Vita,Olivia Riccomi,Michele Ferramola,Federico Felizzi,Antonio Cristiano,Lorenzo De Mori,Chiara Battipaglia,Melissa Sawaya,Luigi De Angelis,Marcello Di Pumpo,Alessandra Piscitelli,Pietro Eric Risuleo,Alessia Longo,Giulia Vojvodic,Mariapia Vassalli,Bianca Destro Castaniti,Nicolò Scarsi,Manuel Del Medico

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, demonstrated high proficiency, English-centric medical examinations, Language Models, Large Language

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

85. 【2604.14261】ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

链接https://arxiv.org/abs/2604.14261

作者:Zhuofeng Li,Yi Lu,Dongfu Jiang,Haoxiang Zhang,Yuyang Bai,Chuan Li,Yu Wang,Shuiwang Ji,Jianwen Xie,Yu Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, driven increasing exploration, peer review support, language models, rapid rise

备注

点击查看摘要

Abstract:The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{this https URL}{here}.

86. 【2604.14228】Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

链接https://arxiv.org/abs/2604.14228

作者:Jiacheng Liu,Xiaohan Zhao,Xinyi Shang,Zhiqiang Shen

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:run shell commands, agentic coding tool, call external services, Claude Code, edit files

备注: Tech report. Code at: [this https URL](https://github.com/VILA-Lab/Dive-into-Claude-Code)

点击查看摘要

Abstract:Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user. This study describes its comprehensive architecture by analyzing the publicly available TypeScript source code and further comparing it with OpenClaw, an independent open-source AI agent system that answers many of the same design questions from a different deployment context. Our analysis identifies five human values, philosophies, and needs that motivate the architecture (human decision authority, safety and security, reliable execution, capability amplification, and contextual adaptability) and traces them through thirteen design principles to specific implementation choices. The core of the system is a simple while-loop that calls the model, runs tools, and repeats. Most of the code, however, lives in the systems around this loop: a permission system with seven modes and an ML-based classifier, a five-layer compaction pipeline for context management, four extensibility mechanisms (MCP, plugins, skills, and hooks), a subagent delegation mechanism with worktree isolation, and append-oriented session storage. A comparison with OpenClaw, a multi-channel personal assistant gateway, shows that the same recurring design questions produce different architectural answers when the deployment context changes: from per-action safety classification to perimeter-level access control, from a single CLI loop to an embedded runtime within a gateway control plane, and from context-window extensions to gateway-wide capability registration. We finally identify six open design directions for future agent systems, grounded in recent empirical, architectural, and policy literature.

87. 【2604.14218】MEME-Fusion@CHiPSAL 2026: Multimodal Ablation Study of Hate Detection and Sentiment Analysis on Nepali Memes

链接https://arxiv.org/abs/2604.14218

作者:Samir Wagle,Reewaj Khanal,Abiral Adhikari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Devanagari-scripted social media, multimodal content structure, script-specific linguistic complexity, Hate speech detection, social media memes

备注: PrePrint

点击查看摘要

Abstract:Hate speech detection in Devanagari-scripted social media memes presents compounded challenges: multimodal content structure, script-specific linguistic complexity, and extreme data scarcity in low-resource settings. This paper presents our system for the CHiPSAL 2026 shared task, addressing both Subtask A (binary hate speech detection) and Subtask B (three-class sentiment classification: positive, neutral, negative). We propose a hybrid cross-modal attention fusion architecture that combines CLIP (ViT-B/32) for visual encoding with BGE-M3 for multilingual text representation, connected through 4-head self-attention and a learnable gating network that dynamically weights modality contributions on a per-sample basis. Systematic evaluation across eight model configurations demonstrates that explicit cross-modal reasoning achieves a 5.9% F1-macro improvement over text-only baselines on Subtask A, while uncovering two unexpected but critical findings: English-centric vision models exhibit near-random performance on Devanagari script, and standard ensemble methods catastrophically degrade under data scarcity (N nearly equal to 850 per fold) due to correlated overfitting. The code can be accessed at this https URL

88. 【2604.14216】Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis

链接https://arxiv.org/abs/2604.14216

作者:Aizierjiang Aiersilan,Mohamad Koubeissi

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Predicting post-surgical seizure, post-surgical seizure outcomes, Predicting post-surgical, post-surgical seizure, seizure outcomes

备注

点击查看摘要

Abstract:Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We propose \emph{Neuro-Oracle}, a three-stage framework that: (i) distils pre-to-post-operative MRI changes into a compact 512-dimensional trajectory vector using a 3D Siamese contrastive encoder; (ii) retrieves historically similar surgical trajectories from a population archive via nearest-neighbour search; and (iii) synthesises a natural-language prognosis grounded in the retrieved evidence using a quantized Llama-3-8B reasoning agent. Evaluations are conducted on the public EPISURG dataset ($N{=}268$ longitudinally paired cases) using five-fold stratified cross-validation. Since ground-truth seizure-freedom scores are unavailable, we utilize a clinical proxy label based on the resection type. We acknowledge that the network representations may potentially learn the anatomical features of the resection cavities (i.e., temporal versus non-temporal locations) rather than true prognostic morphometry. Our current evaluation thus serves mainly as a proof-of-concept for the trajectory-aware retrieval architecture. Trajectory-based classifiers achieve AUC values between 0.834 and 0.905, compared with 0.793 for a single-timepoint ResNet-50 baseline. The Neuro-Oracle agent (M5) matches the AUC of purely discriminative trajectory classifiers (0.867) while producing structured justifications with zero observed hallucinations under our audit protocol. A Siamese Diversity Ensemble (M6) of trajectory-space classifiers attains an AUC of 0.905 without language-model overhead.

89. 【2604.14214】CROP: Token-Efficient Reasoning in Large Language Models via Regularized Prompt Optimization

链接https://arxiv.org/abs/2604.14214

作者:Deep Shah,Sanket Badhe,Nehal Kathrotia,Priyanka Tiwari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models utilizing, Large Language, Language Models, incur significant latency

备注: Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models

点击查看摘要

Abstract:Large Language Models utilizing reasoning techniques improve task performance but incur significant latency and token costs due to verbose generation. Existing automatic prompt optimization(APO) frameworks target task accuracy exclusively at the expense of generating long reasoning traces. We propose Cost-Regularized Optimization of Prompts (CROP), an APO method that introduces regularization on response length by generating textual feedback in addition to standard accuracy feedback. This forces the optimization process to produce prompts that elicit concise responses containing only critical information and reasoning. We evaluate our approach on complex reasoning datasets, specifically GSM8K, LogiQA and BIG-Bench Hard. We achieved an 80.6\% reduction in token consumption while maintaining competitive accuracy, seeing only a nominal decline in performance. This presents a pragmatic solution for deploying token-efficient and cost-effective agentic AI systems in production pipelines.

90. 【2604.14210】Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

链接https://arxiv.org/abs/2604.14210

作者:Simiao Ren,Xingyu Shen,Yuchen Zhou,Dennis(Tsang)Ng,Ankit Raj

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:LLM coding tasks, LLM coding, Chinese, potentially reducing costs, potentially reducing

备注

点击查看摘要

Abstract:A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.

91. 【2604.14198】MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

链接https://arxiv.org/abs/2604.14198

作者:Bingbing Wen,Sirajul Salekin,Feiyang Kang,Bill Howe,Lucy Lu Wang,Javier Movellan,Manjot Bilkhu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains largely unexplored, midtraining remains largely, Domain reweighting, multimodal midtraining remains, improve sample efficiency

备注

点击查看摘要

Abstract:Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We evaluate on 10 benchmarks spanning visual understanding, document reasoning, and multimodal reasoning. On Qwen2-7B, optimized mixtures improve average performance by 8.5%-17.6% over the strongest baseline; on Qwen2.5-7B, gains are 1.0%-3.3%. Both settings reach baseline-equivalent training loss in up to 2 times fewer steps. Recipes discovered on 0.5B proxies transfer to 7B-scale training across Qwen model families.

92. 【2604.14197】he PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

链接https://arxiv.org/abs/2604.14197

作者:David A. Cook

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language model, performance depends heavily, Large language, language model, performance depends

备注: Presents the novel PICCO framework for LLM prompting, derived through a structured multi-database search and rigorous comparative synthesis of 11 published prompting frameworks. Submitted in PDF/A format to preserve the structure and readability of several multi-page tables central to the framework and methodology; these contain dense structured information that is best preserved in PDF form

点击查看摘要

Abstract:Large language model (LLM) performance depends heavily on prompt design, yet prompt construction is often described and applied inconsistently. Our purpose was to derive a reference framework for structuring LLM prompts. This paper presents PICCO, a framework derived through a rigorous synthesis of 11 previously published prompting frameworks identified through a multi-database search. The analysis yields two main contributions. First, it proposes a taxonomy that distinguishes prompt frameworks, prompt elements, prompt generation, prompting techniques, and prompt engineering as related but non-equivalent concepts. Second, it derives a five-element reference architecture for prompt generation: Persona, Instructions, Context, Constraints, and Output (PICCO). For each element, we define its function, scope, and relationship to other elements, with the goal of improving conceptual clarity and supporting more systematic prompt design. Finally, to support application of the framework, we outline key concepts relevant to implementation, including prompting techniques (e.g., zero-shot, few-shot, chain-of-thought, ensembling, decomposition, and self-critique, with selected variants), human and automated approaches to iterative prompt engineering, responsible prompting considerations such as security, privacy, bias, and trust, and priorities for future research. This work is a conceptual and methodological contribution: it formalizes a common structure for prompt specification and comparison, but does not claim empirical validation of PICCO as an optimization method.

93. 【2604.14191】Attention to Mamba: A Recipe for Cross-Architecture Distillation

链接https://arxiv.org/abs/2604.14191

作者:Abhinav Moudgil,Ningyuan Huang,Eeshan Gunesh Dhekane,Pau Rodríguez,Luca Zappella,Federico Danieli

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:State Space Models, State Space, reduced memory consumption, Space Models, Attention-based counterparts

备注

点击查看摘要

Abstract:State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a naïve distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.

94. 【2604.14180】Internal Knowledge Without External Expression: Probing the Generalization Boundary of a Classical Chinese Language Model

链接https://arxiv.org/abs/2604.14180

作者:Jiuting Chen,Yuan Lian,Hao Wu,Tianqi Huang,Hiroshi Sasaki,Makoto Kouno,Jongil Choi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Arabic numerals, characters or Arabic, pure Classical Chinese, Classical Chinese, Transformer language model

备注: 15 pages, 5 figures, supplementary material included

点击查看摘要

Abstract:We train a 318M-parameter Transformer language model from scratch on a curated corpus of 1.56 billion tokens of pure Classical Chinese, with zero English characters or Arabic numerals. Through systematic out-of-distribution (OOD) testing, we investigate whether the model can distinguish known from unknown inputs, and crucially, whether it can express this distinction in its generated text. We find a clear dissociation between internal and external uncertainty. Internally, the model exhibits a perplexity jump ratio of 2.39x between real and fabricated historical events (p = 8.9e-11, n = 92 per group), with semi-fabricated events (real figures + fictional events) showing the highest perplexity (4.24x, p = 1.1e-16), demonstrating genuine factual encoding beyond syntactic pattern matching. Externally, however, the model never learns to express uncertainty: classical Chinese epistemic markers appear at lower rates for OOD questions (3.5%) than for in-distribution questions (8.3%, p = 0.023), reflecting rhetorical conventions rather than genuine metacognition. We replicate both findings across three languages (Classical Chinese, English, Japanese), three writing systems, and eight models from 110M to 1.56B parameters. We further show that uncertainty expression frequency is determined entirely by training data conventions, with Classical Chinese models showing a "humility paradox" (more hedging for known topics), while Japanese models almost never hedge. We argue that metacognitive expression -- the ability to say "I don't know" -- does not emerge from language modeling alone and requires explicit training signals such as RLHF.

Comments:
15 pages, 5 figures, supplementary material included

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.14180 [cs.CL]

(or
arXiv:2604.14180v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14180

Focus to learn more

              arXiv-issued DOI via DataCite</p>
95. 【2604.14179】An Underexplored Frontier: Large Language Models for Rare Disease Patient Education and Communication -- A scoping review

链接https://arxiv.org/abs/2604.14179

作者:Zaifu Zhan,Yu Hou,Kai Yu,Min Zeng,Anita Burgun,Xiaoyi Chen,Rui Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:million people worldwide, complex care pathways, long patient journey, Rare diseases affect, limited clinical expertise

备注

点击查看摘要

Abstract:Rare diseases affect over 300 million people worldwide and are characterized by complex care pathways, limited clinical expertise, and substantial unmet communication needs throughout the long patient journey. Recent advances in large language models (LLMs) offer new opportunities to support patient education and communication, yet their application in rare diseases remains unclear. We conducted a scoping review of studies published between January 2022 and March 2026 across major databases, identifying 12 studies on LLM-based rare disease patient education and communication. Data were extracted on study characteristics, application scenarios, model usage, and evaluation methods, and synthesized using descriptive and qualitative analyses. The literature is highly recent and dominated by general-purpose models, particularly ChatGPT. Most studies focus on patient question answering using curated question sets, with limited use of real-world data or longitudinal communication scenarios. Evaluations are primarily centered on accuracy, with limited attention to patient-centered dimensions such as readability, empathy, and communication quality. Multilingual communication is rarely addressed. Overall, the field remains at an early stage. Future research should prioritize patient-centered design, domain-adapted methods, and real-world deployment to support safe, adaptive, and effective communication in rare diseases.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.14179 [cs.CL]

(or
arXiv:2604.14179v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14179

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Zaifu Zhan [view email] [v1]
Mon, 30 Mar 2026 17:14:48 UTC (1,035 KB)

96. 【2604.14177】Listen, Correct, and Feed Back: Spoken Pedagogical Feedback Generation

链接https://arxiv.org/abs/2604.14177

作者:Junhong Liang,Yifan Lu,Ekaterina Kochmar,Fajri Koto

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:made rapid progress, real teaching scenarios, Grammatical error correction, learner-friendly pedagogical feedback, Spoken Grammatical Error

备注: NLP8506 course project

点击查看摘要

Abstract:Grammatical error correction (GEC) and explanation (GEE) have made rapid progress, but real teaching scenarios also require \emph{learner-friendly pedagogical feedback} that is actionable, level-appropriate, and encouraging. We introduce \textbf{SPFG} (\textbf{S}poken \textbf{P}edagogical \textbf{F}eedback \textbf{G}eneration), a dataset built based on the Speak \ Improve Challenge 2025 corpus, pairing fluency-oriented transcriptions with GEC targets and \emph{human-verified} teacher-style feedback, including preferred/rejected feedback pairs for preference learning. We study a transcript-based Spoken Grammatical Error Correction (SGEC) setting and evaluate three instruction-tuned LLMs (Qwen2.5, Llama-3.1, and GLM-4), comparing supervised fine-tuning (SFT) with preference-based alignment (using DPO and KTO) for jointly generating corrections and feedback. Results show that SFT provides the most consistent improvements, while DPO/KTO yield smaller or mixed gains, and that correction quality and feedback quality are weakly coupled. Our implementation is available at this https URL.

97. 【2604.14175】QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment

链接https://arxiv.org/abs/2604.14175

作者:Mohammad AL-Smadi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:ArchEHR-QA Shared Task, Shared Task, evidence sentence alignment, unified system addressing, ArchEHR-QA Shared

备注: Accepted for publication at CL4Health 2026 workshop, LREC2026 conference

点击查看摘要

Abstract:We present a unified system addressing both Subtask 3 (answer generation) and Subtask 4 (evidence sentence alignment) of the ArchEHR-QA Shared Task. For Subtask 3, we apply two-stage Quantised Low-Rank Adaptation (QLoRA) to Qwen3-4B loaded in 4-bit NF4 quantisation: first on 30,000 samples from the emrQA-MedSQuAD corpus to establish clinical domain competence, then on the 20 annotated development cases to learn the task-specific output style. Our system achieves an overall score of 32.87 on the official test-2026 split (BLEU = 9.42, ROUGE-L = 27.04, SARI = 55.42, BERTScore = 43.00, AlignScore = 25.28, MEDCON = 37.04). For Subtask 4, we develop a weighted ensemble of three retrieval methods - BM25 with relative thresholding, TF-IDF cosine similarity, and a fine-tuned cross-encoder - to identify note sentences supporting a given gold answer, achieving a micro-F1 of 67.16 on the 100-case test set. Experiments reveal that both subtasks expose the same fundamental challenge: 20 annotated training cases are insufficient to distinguish relevant from irrelevant clinical sentences, pointing to data augmentation as the highest-leverage future direction.

98. 【2604.14174】Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

链接https://arxiv.org/abs/2604.14174

作者:Bryan Sanchez

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Alignment-tuned language models, frequently suppress factual, suppress factual log-probabilities, politically sensitive topics, Alignment-tuned language

备注: 12 pages, 3 figures, code at [this https URL](https://github.com/SolomonB14D3/qwen-adapter-correction)

点击查看摘要

Abstract:Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(this http URL()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX.

99. 【2604.14172】ug-of-War within A Decade: Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generations

链接https://arxiv.org/abs/2604.14172

作者:Ziyin Zhou,Jianyi Zhang,Xu ji,Yilong Li,Jiameng Han,Zhangchi Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, essential for analyzing, analyzing and addressing

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are essential for analyzing and addressing vulnerabilities in cybersecurity. However, among over 200,000 vulnerabilities were discovered in the past decade, more than 30,000 have been changed or updated. This necessitates frequent updates to the training datasets and internal knowledge bases of LLMs to maintain knowledge consistency. In this paper, we focus on the problem of knowledge discrepancy and conflict within CVE (Common Vulnerabilities and Exposures) detection and analysis. This problem hinders LLMs' ability to retrieve the latest knowledge from original training datasets, leading to knowledge conflicts, fabrications of factually incorrect results, and generation hallucinations. To address this problem, we propose an innovative two-stage framework called CRVA-TGRAG (Conflict Resolution in Vulnerability Analysis via Teacher-Guided Retrieval-Augmented Generation). First, to improve document retrieval accuracy during the retrieval stage, we utilize Parent Document Segmentation and an ensemble retrieval scheme based on semantic similarity and inverted indexing. Second, to enhance LLMs' capabilities based on the retrieval of CVE dataset in generation stage, we employ a teacher-guided preference optimization technique to fine-tune LLMs. Our framework not only enhances the quality of content retrieval through RAG but also leverages the advantages of preference fine-tuning in LLMs to answer questions more effectively and precisely. Experiments demonstrate our method achieves higher accuracy in retrieving the latest CVEs compared to external knowledge bases. In conclusion, our framework significantly mitigates potential knowledge conflicts and inconsistencies that may arise from relying solely on LLMs for knowledge retrieval.

100. 【2604.14171】Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali

链接https://arxiv.org/abs/2604.14171

作者:Ananda Rimal(Nepal Engineering College),Adarsha Rimal(Tribhuvan University)

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Nepali language written, Large Language, informal digital communication, remains critically underresourced

备注: 31 pages, 4 figures, 14 tables

点击查看摘要

Abstract:Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.

101. 【2604.14170】Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

链接https://arxiv.org/abs/2604.14170

作者:Qi Dong,Ziheng Lin,Ning Ding

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:grounds Large Language, Large Language Models, Large Language, flat context representations, Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.

102. 【2604.14169】Chronological Knowledge Retrieval: A Retrieval-Augmented Generation Approach to Construction Project Documentation

链接https://arxiv.org/abs/2604.14169

作者:Ioannis-Aris Kostis,Natalia Sanchiz,Steeve De Schryver,François Denis,Pierre Schaus

类目:Computation and Language (cs.CL)

关键词:generates extensive records, decisions generates extensive, extensive records, continuous evolution, generates extensive

备注

点击查看摘要

Abstract:In large-scale construction projects, the continuous evolution of decisions generates extensive records, most often captured in meeting minutes. Since decisions may override previous ones, professionals often need to reconstruct the history of specific choices. Retrieving such information manually from raw archives is both labor-intensive and error-prone. From a user perspective, we address this challenge by enabling conversational access to the whole set of project meeting minutes. Professionals can pose natural-language questions and receive answers that are both semantically relevant and explicitly time-annotated, allowing them to follow the chronology of decisions. From a technical perspective, our solution employs a Retrieval-Augmented Generation (RAG) framework that integrates semantic search with large language models to ensure accurate and context-aware responses. We demonstrate the approach using an anonymized, industry-sourced dataset of meeting minutes from a completed construction project by a large company in Belgium. The dataset is annotated and enriched with expert-defined queries to support systematic evaluation. Both the dataset and the open-source implementation are made available to the community to foster further research on conversational access to time-annotated project documentation.

103. 【2604.14168】SAGE Celer 2.6 Technical Card

链接https://arxiv.org/abs/2604.14168

作者:SAGEA Research Team,Basab Jha,Firoj Paudel,Ujjwal Puri,Adrian Liu,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:introduce SAGE Celer, introduce SAGE, general-purpose Celer models, SAGE Celer, line of general-purpose

备注: 28 pages, 14 figures

点击查看摘要

Abstract:We introduce SAGE Celer 2.6, the latest in our line of general-purpose Celer models from SAGEA. Celer 2.6 is available in 5B, 10B, and 27B parameter sizes and benefits from extensive architectural modifications and further pre-training on an undisclosed model. Using our Inverse Reasoning (IR) pipeline, SAGEA natively trains Celer 2.6 to validate its own logic paths, minimizing cascading error and hallucination in complex reasoning tasks. Celer 2.6 also boasts natively integrated multimodal functionality with an end-to-end vision encoder to avoid common pitfalls in adapter-based approaches. Celer 2.6 provides highly competitive results on mathematics, coding, and general intelligence benchmarks (ACUMEN), along with low latency. Most importantly, Celer 2.6 is specifically optimized for South Asian language support, with a custom tokenizer for the Devanagari script and strong performance in both Nepali and Hindi without sacrificing English reasoning ability.

104. 【2604.14167】Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble

链接https://arxiv.org/abs/2604.14167

作者:Yuxuan Lai,Xiajing Wang,Chen Zheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:automated essay scoring, critical component, component in automated, Large Language Models, Rhetoric recognition

备注: Accepted by CCL2025

点击查看摘要

Abstract:Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.

105. 【2604.14166】Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text

链接https://arxiv.org/abs/2604.14166

作者:Filippo Morbiato,Markus Keller,Priya Nair,Luca Romano

类目:Computation and Language (cs.CL)

关键词:Mapping Cyber Threat, Cyber Threat Intelligence, automating threat defense, Mapping Cyber, Threat Intelligence

备注

点击查看摘要

Abstract:Mapping Cyber Threat Intelligence (CTI) text to MITRE ATT\CK technique IDs is a critical task for understanding adversary behaviors and automating threat defense. While recent Retrieval-Augmented Generation (RAG) approaches have demonstrated promising capabilities in this domain, they fundamentally rely on a flat retrieval paradigm. By treating all techniques uniformly, these methods overlook the inherent taxonomy of the ATT\CK framework, where techniques are structurally organized under high-level tactics. In this paper, we propose H-TechniqueRAG, a novel hierarchical RAG framework that injects this tactic-technique taxonomy as a strong inductive bias to achieve highly efficient and accurate annotation. Our approach introduces a two-stage hierarchical retrieval mechanism: it first identifies the macro-level tactics (the adversary's technical goals) and subsequently narrows the search to techniques within those tactics, effectively reducing the candidate search space by 77.5\%. To further bridge the gap between retrieval and generation, we design a tactic-aware reranking module and a hierarchy-constrained context organization strategy that mitigates LLM context overload and improves reasoning precision. Comprehensive experiments across three diverse CTI datasets demonstrate that H-TechniqueRAG not only outperforms the state-of-the-art TechniqueRAG by 3.8\% in F1 score, but also achieves a 62.4\% reduction in inference latency and a 60\% decrease in LLM API calls. Further analysis reveals that our hierarchical structural priors equip the model with superior cross-domain generalization and provide security analysts with highly interpretable, step-by-step decision paths.

106. 【2604.14165】EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

链接https://arxiv.org/abs/2604.14165

作者:Naman Ahuja,Saniya Mulla,Muhammad Ali Khan,Zaryab Bin Riaz,Kaneez Zahra Rubab Khakwani,Mohamad Bassam Sonbol,Irbaz Bin Riaz,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:guaranteeing per-cell provenance, native trial PDFs, ontology-aligned clinical evidence, evidence tables directly, clinical evidence tables

备注

点击查看摘要

Abstract:We present EviSearch, a multi-agent extraction system that automates the creation of ontology-aligned clinical evidence tables directly from native trial PDFs while guaranteeing per-cell provenance for audit and human verification. EviSearch pairs a PDF-query agent (which preserves rendered layout and figures) with a retrieval-guided search agent and a reconciliation module that forces page-level verification when agents disagree. The pipeline is designed for high-precision extraction across multimodal evidence sources (text, tables, figures) and for generating reviewer-actionable provenance that clinicians can inspect and correct. On a clinician-curated benchmark of oncology trial papers, EviSearch substantially improves extraction accuracy relative to strong parsed-text baselines while providing comprehensive attribution coverage. By logging reconciler decisions and reviewer edits, the system produces structured preference and supervision signals that bootstrap iterative model improvement. EviSearch is intended to accelerate living systematic review workflows, reduce manual curation burden, and provide a safe, auditable path for integrating LLM-based extraction into evidence synthesis pipelines.

107. 【2604.14164】How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

链接https://arxiv.org/abs/2604.14164

作者:Zixian Huang,Kaichen Yang,Xu Huang,Feiyang Hao,Qiming Ge,Bowen Li,He Du,Kai Chen,Qipeng Guo

类目:Computation and Language (cs.CL)

关键词:widely adopted strategy, widely adopted, adopted strategy, Cooperation Data Synthesis, SFT

备注

点击查看摘要

Abstract:A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

108. 【2604.14163】SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

链接https://arxiv.org/abs/2604.14163

作者:Tomer Atia,Yehudit Aperstein,Alexander Apartsin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:safety-critical voice messages, Maritime distress communications, Global Maritime Distress, distress communications transmitted, high frequency

备注: 12 pages, 8 figures

点击查看摘要

Abstract:Maritime distress communications transmitted over very high frequency (VHF) radio are safety-critical voice messages used to report emergencies at sea. Under the Global Maritime Distress and Safety System (GMDSS), such messages follow standardized procedures and are expected to convey essential details, including vessel identity, position, nature of the distress, and required assistance. In practice, however, automatic analysis remains difficult because distress messages are often brief, noisy, and produced under stress, may deviate from the prescribed format, and are further degraded by automatic speech recognition (ASR) errors caused by channel noise and speaker stress. This paper presents SeaAlert, an LLM-based framework for robust analysis of maritime distress communications. To address the scarcity of labeled real-world data, we develop a synthetic data generation pipeline in which an LLM produces realistic and diverse maritime messages, including challenging variants in which standard distress codewords are omitted or replaced with less explicit expressions. The generated utterances are synthesized into speech, degraded with simulated VHF noise, and transcribed by an ASR system to obtain realistic noisy transcripts.

109. 【2604.14162】Decoupling Scores and Text: The Politeness Principle in Peer Review

链接https://arxiv.org/abs/2604.14162

作者:Yingxuan Wen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:deriving false hope, peer review feedback, interpret peer review, deriving false, struggle to interpret

备注

点击查看摘要

Abstract:Authors often struggle to interpret peer review feedback, deriving false hope from polite comments or feeling confused by specific low scores. To investigate this, we construct a dataset of over 30,000 ICLR 2021-2025 submissions and compare acceptance prediction performance using numerical scores versus text reviews. Our experiments reveal a significant performance gap: score-based models achieve 91% accuracy, while text-based models reach only 81% even with large language models, indicating that textual information is considerably less reliable. To explain this phenomenon, we first analyze the 9% of samples that score-based models fail to predict, finding their score distributions exhibit high kurtosis and negative skewness, which suggests that individual low scores play a decisive role in rejection even when the average score falls near the borderline. We then examine why text-based accuracy significantly lags behind scores from a review sentiment perspective, revealing the prevalence of the Politeness Principle: reviews of rejected papers still contain more positive than negative sentiment words, masking the true rejection signal and making it difficult for authors to judge outcomes from text alone.

110. 【2604.14161】Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

链接https://arxiv.org/abs/2604.14161

作者:Domonkos Varga

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:machine learning research, flaws-particularly data leakage-continue, Reliable evaluation, essential in machine, leakage-continue to undermine

备注

点击查看摘要

Abstract:Reliable evaluation is essential in machine learning research, yet methodological flaws-particularly data leakage-continue to undermine the validity of reported results. In this work, we investigate whether large language models (LLMs) can act as independent analytical agents capable of identifying such issues in published studies. As a case study, we analyze a gesture-recognition paper reporting near-perfect accuracy on a small, human-centered dataset. We first show that the evaluation protocol is consistent with subject-level data leakage due to non-independent training and test splits. We then assess whether this flaw can be detected independently by six state-of-the-art LLMs, each analyzing the original paper without prior context using an identical prompt. All models consistently identify the evaluation as flawed and attribute the reported performance to non-independent data partitioning, supported by indicators such as overlapping learning curves, minimal generalization gap, and near-perfect classification results. These findings suggest that LLMs can detect common methodological issues based solely on published artifacts. While not definitive, their consistent agreement highlights their potential as complementary tools for improving reproducibility and supporting scientific auditing.

111. 【2604.14159】HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization

链接https://arxiv.org/abs/2604.14159

作者:Baocai Shan,Yuzhuang Xu,Wanxiang Che

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:produce personalized text, input method editors, method editors, primary interface, remain constrained

备注

点击查看摘要

Abstract:Mobile input method editors (IMEs) are the primary interface for text input, yet they remain constrained to manual typing and struggle to produce personalized text. While lightweight large language models (LLMs) make on-device auxiliary generation feasible, enabling deeply personalized, privacy-preserving, and real-time generative IMEs poses fundamental this http URL this end, we present HUOZIIME, a personalized on-device IME powered by LLM. We endow HUOZIIME with initial human-like prediction ability by post-training a base LLM on synthesized personalization data. Notably, a hierarchical memory mechanism is designed to continually capture and leverage user-specific input history. Furthermore, we perform systemic optimizations tailored to on-device LLMbased IME deployment, ensuring efficient and responsive operation under mobile this http URL demonstrate efficient on-device execution and high-fidelity memory-driven personalization. Code and package are available at this https URL.

112. 【2604.14158】MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

链接https://arxiv.org/abs/2604.14158

作者:Yihang Ding,Wanke Xia,Yiting Zhao,Jinbo Su,Jialiang Yang,Zhengbo Zhang,Ke Wang,Wenming Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Current evaluations, memory, fundamentally static, Memory Fragments Unlocked, Surface State Memory

备注

点击查看摘要

Abstract:Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.

113. 【2604.14156】Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

链接https://arxiv.org/abs/2604.14156

作者:Andrew Kiruluta

类目:Computation and Language (cs.CL)

关键词:massive parameter counts, deliver strong generative, strong generative performance, Large language models, models deliver strong

备注

点击查看摘要

Abstract:Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.14156 [cs.CL]

(or
arXiv:2604.14156v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.14156

Focus to learn more

              arXiv-issued DOI via DataCite</p>
114. 【2604.14152】From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation

链接https://arxiv.org/abs/2604.14152

作者:Abdolamir Karbalaie,Fernando Seoane,Farhad Abtahi

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, clinical documentation burden, documentation burden, speech recognition, reduce clinical documentation

备注

点击查看摘要

Abstract:Ambient AI "scribe" systems promise to reduce clinical documentation burden, but automatic speech recognition (ASR) errors can remain unnoticed without careful review, and high-quality human reference transcripts are often unavailable for calibrating uncertainty. We investigate whether cross-model disagreement among heterogeneous ASR systems can act as a reference-free uncertainty signal to prioritize human verification in medical transcription workflows. Using 50 publicly available medical education audio clips (8 h 14 min), we transcribed each clip with eight ASR systems spanning commercial APIs and open-source engines. We aligned multi-model outputs, built consensus pseudo-references, and quantified token-level agreement using a majority-strength metric; we further characterized disagreements by type (content vs. punctuation/formatting) and assessed per-model agreement via leave-one-model-out (jackknife) consensus scoring. Inter-model reliability was low (ICC[2,1] = 0.131), indicating heterogeneous failure modes across systems. Across 76,398 evaluated token positions, 72.1% showed near-unanimous agreement (7-8 models), while 2.5% fell into high-risk bands (0-3 models), with high-risk mass varying from 0.7% to 11.4% across accent groups. Low-agreement regions were enriched for content disagreements, with the content fraction increasing from 53.9% to 73.9% across quintiles of high-risk mass. These results suggest that cross-model disagreement provides a sparse, localizable signal that can surface potentially unreliable transcript spans without human-verified references, enabling targeted review; clinical accuracy of flagged regions remains to be established.

115. 【2508.01302】Aligning Language Models with Real-time Knowledge Editing

链接https://arxiv.org/abs/2508.01302

作者:Chenming Tang,Yutong Yang,Kexue Wang,Yunfang Wu

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)

关键词:modify outdated knowledge, Knowledge editing, large language models, Knowledge editing aims, efficiently while retaining

备注: Pre-print

点击查看摘要

Abstract:Knowledge editing aims to modify outdated knowledge in large language models (LLMs) efficiently while retaining their original capabilities. Mainstream benchmarks for knowledge editing are predominantly static and fail to keep in pace with the evolving real-world knowledge. In this work, we introduce CRAFT, an ever-evolving real-world benchmark for knowledge editing. It features well-designed paired edits for composite reasoning, and evaluates models on alias portability as well as temporal and common-sense locality, making it a challenging knowledge editing benchmark on which previous knowledge editing methods hardly achieve balanced performance. Towards flexible real-time editing, we propose KEDAS, a novel paradigm of knowledge editing alignment featuring diverse edit augmentation and self-adaptive post-alignment inference, which exhibits significant performance gain on CRAFT compared to previous methods. All of our code and data are available at this https URL.

116. 【2604.14188】Grading the Unspoken: Evaluating Tacit Reasoning in Quantum Field Theory and String Theory with LLMs

链接https://arxiv.org/abs/2604.14188

作者:Xingyang Yu,Yinghuan Zhang,Yufei Zhang,Zijun Cui

类目:Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th)

关键词:Large language models, Large language, demonstrated impressive performance, demonstrated impressive, Large

备注: 9 pages + appendices, 2 figures, 9 tables

点击查看摘要

Abstract:Large language models have demonstrated impressive performance across many domains of mathematics and physics. One natural question is whether such models can support research in highly abstract theoretical fields such as quantum field theory and string theory. Evaluating this possibility faces an immediate challenge: correctness in these domains is layered, tacit, and fundamentally non-binary. Standard answer-matching metrics fail to capture whether intermediate conceptual steps are properly reconstructed or whether implicit structural constraints are respected. We construct a compact expert-curated dataset of twelve questions spanning core areas of quantum field theory and string theory, and introduce a five-level grading rubric separating statement correctness, key concept awareness, reasoning chain presence, tacit step reconstruction, and enrichment. Evaluating multiple contemporary LLMs, we observe near-ceiling performance on explicit derivations within stable conceptual frames, but systematic degradation when tasks require reconstruction of omitted reasoning steps or reorganization of representations under global consistency constraints. These failures are driven not only by missing intermediate steps, but by an instability in representation selection: models often fail to identify the correct conceptual framing required to resolve implicit tensions. We argue that highly abstract theoretical physics provides a uniquely sensitive lens on the epistemic limits of current evaluation paradigms.

117. 【2604.14186】HARNESS: Lightweight Distilled Arabic Speech Foundation Models

链接https://arxiv.org/abs/2604.14186

作者:Vrunda N. Sukhadia,Shammur Absar Chowdhury

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:size limits deployment, Automatic Speech Recognition, Speech Emotion Recognition, Large self-supervised speech, resource-constrained settings

备注: 8 pages, 2 figures

点击查看摘要

Abstract:Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervised speech model family trained from scratch with iterative self-distillation, together with lightweight student variants that offer strong accuracy-efficiency trade-offs on Automatic Speech Recognition (ASR), Dialect Identification (DID), and Speech Emotion Recognition (SER). Our approach begins with a large bilingual Arabic-English teacher and progressively distills its knowledge into compressed student models while preserving Arabic-relevant acoustic and paralinguistic representations. We further study PCA-based compression of the teacher supervision signal to better match the capacity of shallow and thin students. Compared with HuBERT and XLS-R, HArnESS consistently improves performance on Arabic downstream tasks, while the compressed models remain competitive under substantial structural reduction. These results position HArnESS as a practical and accessible Arabic-centric SSL foundation for real-world speech applications.

信息检索

1. 【2604.15148】IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

链接https://arxiv.org/abs/2604.15148

作者:Zihan Liang,Yufei Ma,Ben Chen,Zhipeng Qian,Huangyu Dai,Lingtao Mao,Xuxin Zhang,Chenyi Lei,Wenwu Ou

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:large language models, perform search-augmented reasoning, training large language, effective paradigm, large language

备注

点击查看摘要

Abstract:Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.

2. 【2604.15101】Metric-agnostic Learning-to-Rank via Boosting and Rank Approximation

链接https://arxiv.org/abs/2604.15101

作者:Camilo Gomez,Pengyang Wang,Yanjie Fu

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Discounted Cumulative Gain, Normalized Discounted Cumulative, models specifically designed, constructs models specifically, supervised machine learning

备注: Published in IEEE ICDM 2023. 6 pages

点击查看摘要

Abstract:Learning-to-Rank (LTR) is a supervised machine learning approach that constructs models specifically designed to order a set of items or documents based on their relevance or importance to a given query or context. Despite significant success in real-world information retrieval systems, current LTR methods rely on one prefix ranking metric (e.g., such as Normalized Discounted Cumulative Gain (NDCG) or Mean Average Precision (MAP)) for optimizing the ranking objective function. Such metric-dependent setting limits LTR methods from two perspectives: (1) non-differentiable problem: directly optimizing ranking functions over a given ranking metric is inherently non-smooth, making the training process unstable and inefficient; (2) limited ranking utility: optimizing over one single metric makes it difficult to generalize well to other ranking metrics of interest. To address the above issues, we propose a novel listwise LTR framework for efficient and generalizable ranking purpose. Specifically, we propose a new differentiable ranking loss that combines a smooth approximation to the ranking operator with the average mean square loss per query. Then, we adapt gradient-boosting machines to minimize our proposed loss with respect to each list, a novel contribution. Finally, extensive experimental results confirm that our method outperforms the current state-of-the-art in information retrieval measures with similar efficiency.

3. 【2604.14972】SAGER: Self-Evolving User Policy Skills for Recommendation Agent

链接https://arxiv.org/abs/2604.14972

作者:Zhen Tao,Riwei Lai,Chenyun Yu,Weixin Chen,Li Chen,Beibei Kong,Lei Cheng,Chengxiang Zhuo,Zang Li,Qingqiang Sun

类目:Information Retrieval (cs.IR)

关键词:Large language model, static system prompt, evolving per-user semantic, system prompt shared, prompt shared identically

备注

点击查看摘要

Abstract:Large language model (LLM) based recommendation agents personalize what they know through evolving per-user semantic memory, yet how they reason remains a universal, static system prompt shared identically across all users. This asymmetry is a fundamental bottleneck: when a recommendation fails, the agent updates its memory of user preferences but never interrogates the decision logic that produced the failure, leaving its reasoning process structurally unchanged regardless of how many mistakes it accumulates. To address this bottleneck, we propose SAGER (Self-Evolving Agent for Personalized Recommendation), the first recommendation agent framework in which each user is equipped with a dedicated policy skill, a structured natural-language document encoding personalized decision principles that evolves continuously through interaction. SAGER introduces a two-representation skill architecture that decouples a rich evolution substrate from a minimal inference-time injection, an incremental contrastive chain-of-thought engine that diagnoses reasoning flaws by contrasting accepted against unchosen items while preserving accumulated priors, and skill-augmented listwise reasoning that creates fine-grained decision boundaries where the evolved skill provides genuine discriminative value. Experiments on four public benchmarks demonstrate that SAGER achieves state-of-the-art performance, with gains orthogonal to memory accumulation, confirming that personalizing the reasoning process itself is a qualitatively distinct source of recommendation improvement.

4. 【2604.14878】GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation

链接https://arxiv.org/abs/2604.14878

作者:Yanyan Zou,Junbo Qi,Lunsong Huang,Yu Li,Kewei Xu,Jiabao Gao,Binglei Zhao,Xuanhua Yang,Sulong Xu,Shengjie Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative Retrieval, offers a promising, next-token prediction, promising paradigm, paradigm for recommendation

备注: SIGIR 2026 Camera-Ready version

点击查看摘要

Abstract:Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.

5. 【2604.14839】Well Begun is Half Done: Training-Free and Model-Agnostic Semantically Guaranteed User Representation Initialization for Multimodal Recommendation

链接https://arxiv.org/abs/2604.14839

作者:Jinfeng Xu,Zheyu Chen,Shuo Yang,Jinze Li,Hewei Wang,Jianheng Tang,Wei Wang,Xiping Hu,Edith C. H. Ngai

类目:Information Retrieval (cs.IR)

关键词:mitigate data sparsity, Recent advancements, gained significant attention, improve recommendation accuracy, leverage diverse modality

备注: Accepted by SIGIR 2026

点击查看摘要

Abstract:Recent advancements in multimodal recommendations, which leverage diverse modality information to mitigate data sparsity and improve recommendation accuracy, have gained significant attention. However, existing multimodal recommendations overlook the critical role of user representation initialization. Unlike items, which are naturally associated with rich modality information, users lack such inherent information. Consequently, item representations initialized based on meaningful modality information and user representations initialized randomly exhibit a significant semantic gap. To this end, we propose a Semantically Guaranteed User Representation Initialization (SG-URInit). SG-URInit constructs the initial representation for each user by integrating both the modality features of the items they have interacted with and the global features of their corresponding clusters. SG-URInit enables the initialization of semantically enriched user representations that effectively capture both local (item-level) and global (cluster-level) semantics. Our SG-URInit is training-free and model-agnostic, meaning it can be seamlessly integrated into existing multimodal recommendation models without incurring any additional computational overhead during training. Extensive experiments on multiple real-world datasets demonstrate that incorporating SG-URInit into advanced multimodal recommendation models significantly enhances recommendation performance. Furthermore, the results show that SG-URInit can further alleviate the item cold-start problem and also accelerate model convergence, making it an efficient and practical solution for multimodal recommendations.

Comments:
Accepted by SIGIR 2026

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.14839 [cs.IR]

(or
arXiv:2604.14839v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.14839

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2604.14833】Federated User Behavior Modeling for Privacy-Preserving LLM Recommendation

链接https://arxiv.org/abs/2604.14833

作者:Lei Guo,Hongyun Yang,Pengjie Ren,Tong Chen,Hui Liu,Zhumin Chen

类目:Information Retrieval (cs.IR)

关键词:Large Language Models, shown great success, Large Language, recommender systems, shown great

备注

点击查看摘要

Abstract:Large Language Models have shown great success in recommender systems. However, the limited and sparse nature of user data often restricts the LLM's ability to effectively model behavior patterns. To address this, existing studies have explored cross-domain solutions by conducting Cross-Domain Recommendation tasks. But previous methods typically assume domains are overlapped and can be accessed readily. None of the LLM methods address the privacy-preserving issues in the CDR settings, that is, Privacy-Preserving Cross-Domain Recommendation. Conducting non-overlapping PPCDR with LLM is challenging since: 1)The inability to share user identity or behavioral data across domains impedes effective cross-domain alignment. 2)The heterogeneity of data modalities across domains complicates knowledge integration. 3)Fusing collaborative filtering signals from traditional recommendation models with LLMs is difficult, as they operate within distinct feature spaces. To address the above issues, we propose SF-UBM, a Semantic-enhanced Federated User Behavior Modeling method. Specifically, to deal with Challenge 1, we leverage natural language as a universal bridge to connect disjoint domains via a semantic-enhanced federated architecture. Here, text-based item representations are encrypted and shared, while user-specific data remains local. To handle Challenge 2, we design a Fact-counter Knowledge Distillation module to integrate domain-agnostic knowledge with domain-specific knowledge, across different data modalities. To tackle Challenge 3, we project pre-learned user preferences and cross-domain item representations into the soft prompt space, aligning behavioral and semantic spaces for effective LLM learning. We conduct extensive experiments on three pairs of real-world domains, and the experimental results demonstrate the effectiveness of SF-UBM compared to the recent SOTA methods.

7. 【2604.14613】Uncertainty-aware Generative Learning Path Recommendation with Cognition-Adaptive Diffusion

链接https://arxiv.org/abs/2604.14613

作者:Xiangrui Xiong,Hang Liang,Baiyang Chen,Zifei Pan,Yanli Lee

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Learning Path Recommendation, diverse learning goals, Generative Learning Path, Uncertainty-aware Generative Learning, Learning Path

备注: 20 pages, 4 figures

点击查看摘要

Abstract:Learning Path Recommendation (LPR) is critical for personalized education, yet current methods often fail to account for historical interaction uncertainty (e.g., lucky guesses or accidental slips) and lack adaptability to diverse learning goals. We propose U-GLAD (Uncertainty-aware Generative Learning Path Recommendation with Cognition-Adaptive Diffusion). To address representation bias, the framework models cognitive states as probability distributions, capturing the learner's underlying true state via a Gaussian LSTM. To ensure highly personalized recommendation, a goal-oriented concept encoder utilizes multi-head attention and objective-specific transformations to dynamically align concept semantics with individual learning goals, generating uniquely tailored embeddings. Unlike traditional discriminative ranking approaches, our model employs a generative diffusion model to predict the latent representation of the next optimal concept. Extensive evaluations on three public datasets demonstrate that U-GLAD significantly outperforms representative baselines. Further analyses confirm its superior capability in perceiving interaction uncertainty and providing stable, goal-driven recommendation paths.

8. 【2604.14598】Category-based and Popularity-guided Video Game Recommendation: A Balance-oriented Framework

链接https://arxiv.org/abs/2604.14598

作者:Xiping Li,Jianghong Ma,Kangzhe Liu,Shanshan Feng,Haijun Zhang,Yutong Wang

类目:Information Retrieval (cs.IR)

关键词:experienced substantial growth, video game industry, game, recent years, substantial growth

备注: Published in The Web Conference (WWW) 2024. 11 pages, 8 figures

点击查看摘要

Abstract:In recent years, the video game industry has experienced substantial growth, presenting players with a vast array of game choices. This surge in options has spurred the need for a specialized recommender system tailored for video games. However, current video game recommendation approaches tend to prioritize accuracy over diversity, potentially leading to unvaried game suggestions. In addition, the existing game recommendation methods commonly lack the ability to establish strict connections between games to enhance accuracy. Furthermore, many existing diversity-focused methods fail to leverage crucial item information, such as item category and popularity during neighbor modeling and message propagation. To address these challenges, we introduce a novel framework, called CPGRec, comprising three modules, namely accuracy-driven, diversity-driven, and comprehensive modules. The first module extends the state-of-the-art accuracy-focused game recommendation method by connecting games in a more stringent manner to enhance recommendation accuracy. The second module connects neighbors with diverse categories within the proposed game graph and harnesses the advantages of popular game nodes to amplify the influence of long-tail games within the player-game bipartite graph, thereby enriching recommendation diversity. The third module combines the above two modules and employs a new negative-sample rating score reweighting method to balance accuracy and diversity. Experimental results on the Steam dataset demonstrate the effectiveness of our proposed method in improving game recommendations. The dataset and source codes are anonymously released at: this https URL.

9. 【2604.14586】CPGRec+: A Balance-oriented Framework for Personalized Video Game Recommendations

链接https://arxiv.org/abs/2604.14586

作者:Xiping Li,Aier Yang,Jianghong Ma,Kangzhe Liu,Shanshan Feng,Haijun Zhang,Yi Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:industry requires advanced, requires advanced recommender, gaming industry requires, Graph Neural Network, advanced recommender systems

备注: Published in ACM Transactions on Information Systems (TOIS). 43 pages, 9 figures

点击查看摘要

Abstract:The rapid expansion of gaming industry requires advanced recommender systems tailored to its dynamic landscape. Existing Graph Neural Network (GNN)-based methods primarily prioritize accuracy over diversity, overlooking their inherent trade-off. To address this, we previously proposed CPGRec, a balance-oriented gaming recommender system. However, CPGRec fails to account for critical disparities in player-game interactions, which carry varying significance in reflecting players' personal preferences and may exacerbate over-smoothness issues inherent in GNN-based models. Moreover, existing approaches underutilize the reasoning capabilities and extensive knowledge of large language models (LLMs) in addressing these limitations. To bridge this gap, we propose two new modules. First, Preference-informed Edge Reweighting (PER) module assigns signed edge weights to qualitatively distinguish significant player interests and disinterests while then quantitatively measuring preference strength to mitigate over-smoothing in graph convolutions. Second, Preference-informed Representation Generation (PRG) module leverages LLMs to generate contextualized descriptions of games and players by reasoning personal preferences from comparing global and personal interests, thereby refining representations of players and games. Experiments on \textcolor{black}{two Steam datasets} demonstrate CPGRec+'s superior accuracy and diversity over state-of-the-art models. The code is accessible at this https URL.

10. 【2604.14581】Behavior-Aware Dual-Channel Preference Learning for Heterogeneous Sequential Recommendation

链接https://arxiv.org/abs/2604.14581

作者:Jing Xiao,Dongqi Wu,Liwei Pan,Yawen Luo,Weike Pan,Zhong Ming

类目:Information Retrieval (cs.IR)

关键词:facilitate precise sequential, Heterogeneous sequential recommendation, precise sequential recommendation, dynamic behavior dependencies, learn dynamic behavior

备注

点击查看摘要

Abstract:Heterogeneous sequential recommendation (HSR) aims to learn dynamic behavior dependencies from the diverse behaviors of user-item interactions to facilitate precise sequential recommendation. Despite many efforts yielding promising achievements, there are still challenges in modeling heterogeneous behavior data. One significant issue is the inherent sparsity of a real-world data, which can weaken the recommendation performance. Although auxiliary behaviors (e.g., clicks) partially address this problem, they inevitably introduce some noise, and the sparsity of the target behavior (e.g., purchases) remains unresolved. Additionally, contrastive learning-based augmentation in existing methods often focuses on a single behavior type, overlooking fine-grained user preferences and losing valuable information. To address these challenges, we have meticulously designed a behavior-aware dual-channel preference learning framework (BDPL). This framework begins with the construction of customized behavior-aware subgraphs to capture personalized behavior transition relationships, followed by a novel cascade-structured graph neural network to aggregate node context information. We then model and enhance user representations through a preference-level contrastive learning paradigm, considering both long-term and short-term preferences. Finally, we fuse the overall preference information using an adaptive gating mechanism to predict the next item the user will interact with under the target behavior. Extensive experiments on three real-world datasets demonstrate the superiority of our BDPL over the state-of-the-art models.

11. 【2604.14572】Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

链接https://arxiv.org/abs/2604.14572

作者:Yiqun Sun,Pengfei Wei,Lawrence B. Hsieh

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:grounds LLM responses, Retrieval-Augmented Generation, grounds LLM, LLM responses, limiting its ability

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird's-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics.

12. 【2604.14510】NewsTorch: A PyTorch-based Toolkit for Learner-oriented News Recommendation

链接https://arxiv.org/abs/2604.14510

作者:Rongyao Wang,Veronica Liesaputra,Zhiyi Huang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:information overload, recent years, recommender systems, systems are devised, devised to alleviate

备注: 3 papes

点击查看摘要

Abstract:News recommender systems are devised to alleviate the information overload, attracting more and more researchers' attention in recent years. The lack of a dedicated learner-oriented news recommendation toolkit hinders the advancement of research in news recommendation. We propose a PyTorch-based news recommendation toolkit called NewsTorch, developed to support learners in acquiring both conceptual understanding and practical experience. This toolkit provides a modular, decoupled, and extensible framework with a learner-friendly GUI platform that supports dataset downloading and preprocessing. It also enables training, validation, and testing of state-of-the-art neural news recommendation models with standardized evaluation metrics, ensuring fair comparison and reproducible experiments. Our open-source toolkit is released on Github: this https URL.

13. 【2604.14488】Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge

链接https://arxiv.org/abs/2604.14488

作者:Andre Bacellar

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:remaining semantically distant, Controlling Authority Retrieval, knowledge accumulates, accumulates under formal, document can formally

备注: 23 pages, 13 tables; code and data at [this https URL](https://github.com/andremir/car-retrieval)

点击查看摘要

Abstract:In any domain where knowledge accumulates under formal authority -- law, drug regulation, software security -- a later document can formally void an earlier one while remaining semantically distant from it. We formalize this as Controlling Authority Retrieval (CAR): recovering the active frontier front(cl(A_k(q))) of the authority closure of the semantic anchor set -- a different mathematical problem from argmax_d s(q,d). The two central results are: Theorem 4 (CAR-Correctness Characterization) gives necessary-and-sufficient conditions on any retrieved set R for TCA(R,q)=1 -- frontier inclusion and no-ignored-superseder -- independent of how R was produced. Proposition 2 (Scope Identifiability Upper Bound) establishes phi(q) as a hard worst-case ceiling: for any scope-indexed algorithm, TCA@k = phi(q) * R_anchor(q), proved by an adversarial permutation argument. Three independent real-world corpora validate the proved structure: security advisories (Dense TCA@5=0.270, two-stage 0.975), SCOTUS overruling pairs (Dense=0.172, two-stage 0.926), FDA drug records (Dense=0.064, two-stage 0.774). A GPT-4o-mini experiment shows the downstream cost: Dense RAG produces explicit "not patched" claims for 39% of queries where a patch exists; Two-Stage cuts this to 16%. Four benchmark datasets, domain adapters, and a single-command scorer are released at this https URL.

14. 【2604.14403】A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.14403

作者:Julian Killingback,Ofer Meshi,Henry Li,Hamed Zamani,Maryam Karimzadehgan

类目:Information Retrieval (cs.IR)

关键词:approaches generally assume, powerful servers removed, approaches generally, end user, Traditional Retrieval-Augmented Generation

备注

点击查看摘要

Abstract:Traditional Retrieval-Augmented Generation (RAG) approaches generally assume that retrieval and generation occur on powerful servers removed from the end user. While this reduces local hardware constraints, it introduces significant drawbacks: privacy concerns regarding data access, recurring maintenance and storage costs, increased latency, and the necessity of an internet connection. On-device RAG addresses these challenges by executing the entire pipeline locally, making it ideal for querying sensitive personal information such as financial documents, contact details, and medical history. However, on-device deployment necessitates a delicate balance between limited memory and disk space. Specifically, the context size provided to the generative model must be restricted to manage KV cache and attention memory usage, while the size of stored embeddings must be minimized to preserve disk space. In this work, we propose a unified model that compresses the RAG context and utilizes the same representations for retrieval. This approach minimizes disk utilization compared to using separate representations, while significantly reducing the context size required for generation. With an average of 1/10 of the context, our model matches the performance of a traditional RAG reader without increasing storage requirements compared to a multi-vector retrieval model. This approach represents the first model to unify retrieval and context compression using a shared model and representation. We believe this work will inspire further consolidation of distinct models to optimize on-device performance.

15. 【2604.14362】APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

链接https://arxiv.org/abs/2604.14362

作者:Pratyay Banerjee,Masud Moshtaghi,Shivashankar Subramanian,Amita Misra,Ankit Chadha

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large language models, simply enlarging context, enlarging context windows, applying naive retrieval, Large language

备注: Accepted to ACL 2026 Mains

点击查看摘要

Abstract:Large language models still struggle with reliable long-term conversational memory: simply enlarging context windows or applying naive retrieval often introduces noise and destabilizes responses. We present APEX-MEM, a conversational memory system that combines three key innovations: (1) a property graph which uses domain-agnostic ontology to structure conversations as temporally grounded events in an entity-centric framework, (2) append-only storage that preserves the full temporal evolution of information, and (3) a multi-tool retrieval agent that understands and resolves conflicting or evolving information at query time, producing a compact and contextually relevant memory summary. This retrieval-time resolution preserves the full interaction history while suppressing irrelevant details. APEX-MEM achieves 88.88% accuracy on LOCOMO's Question Answering task and 86.2% on LongMemEval, outperforming state-of-the-art session-aware approaches and demonstrating that structured property graphs enable more temporally coherent long-term conversational reasoning.

16. 【2604.14256】Evaluation of Agents under Simulated AI Marketplace Dynamics

链接https://arxiv.org/abs/2604.14256

作者:To Eun Kim,Alireza Salemi,Hamed Zamani,Fernando Diaz

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large language models, Modern information access, access ecosystems consist, language models, Modern information

备注: SIGIR 2026

点击查看摘要

Abstract:Modern information access ecosystems consist of mixtures of systems, such as retrieval systems and large language models, and increasingly rely on marketplaces to mediate access to models, tools, and data, making competition between systems inherent to deployment. In such settings, outcomes are shaped not only by benchmark quality but also by competitive pressure, including user switching, routing decisions, and operational constraints. Yet evaluation is still largely conducted on static benchmarks with accuracy-focused measures that assume systems operate in isolation. This mismatch makes it difficult to predict post-deployment success and obscures competitive effects such as early-adoption advantages and market dominance. We introduce Marketplace Evaluation, a simulation-based paradigm that evaluates information access systems as participants in a competitive marketplace. By simulating repeated interactions and evolving user and agent preferences, the framework enables longitudinal evaluation and marketplace-level metrics, such as retention and market share, that complement and can extend beyond traditional accuracy-based metrics. We formalize the framework and outline a research agenda, motivated by business and economics, around marketplace simulation, metrics, optimization, and adoption in evaluation campaigns like TREC.

17. 【2604.14227】FRESCO: Benchmarking and Optimizing Re-rankers for Evolving Semantic Conflict in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.14227

作者:Sohyun An(1 and 2),Hayeon Lee(1),Shuibenyang Yuan(1),Chun-cheng Jason Chen(1),Cho-Jui Hsieh(2),Vijai Mohan(1),Alexander Min(1) ((1) Meta Superintelligence Labs, (2) UCLA)

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:large language models, Retrieval-Augmented Generation, language models, key approach, approach to mitigating

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a key approach to mitigating the temporal staleness of large language models (LLMs) by grounding responses in up-to-date evidence. Within the RAG pipeline, re-rankers play a pivotal role in selecting the most useful documents from retrieved candidates. However, existing benchmarks predominantly evaluate re-rankers in static settings and do not adequately assess performance under evolving information -- a critical gap, as real-world systems often must choose among temporally different pieces of evidence. To address this limitation, we introduce FRESCO (Factual Recency and Evolving Semantic COnflict), a benchmark for evaluating re-rankers in temporally dynamic contexts. By pairing recency-seeking queries with historical Wikipedia revisions, FRESCO tests whether re-rankers can prioritize factually recent evidence while maintaining semantic relevance. Our evaluation reveals a consistent failure mode across existing re-rankers: a strong bias toward older, semantically rich documents, even when they are factually obsolete. We further investigate an instruction optimization framework to mitigate this issue. By identifying Pareto-optimal instructions that balance Evolving and Non-Evolving Knowledge tasks, we obtain gains of up to 27% on Evolving Knowledge tasks while maintaining competitive performance on Non-Evolving Knowledge tasks.

18. 【2604.14223】RACE: A Conversational Framework for Sustainable Tourism Recommendation with Agentic Counterfactual Explanations

链接https://arxiv.org/abs/2604.14223

作者:Ashmi Banerjee,Adithi Satish,Wolfgang Wörndl,Yashar Deldjoo

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Traditional conversational travel, carbon-intensive travel choices, conversational travel recommender, travel recommender systems, recommender systems primarily

备注

点击查看摘要

Abstract:Traditional conversational travel recommender systems primarily optimize for user relevance and convenience, often reinforcing popular, overcrowded destinations and carbon-intensive travel choices. To address this, we present TRACE (Tourism Recommendation with Agentic Counterfactual Explanations), a multi-agent, LLM-based framework that promotes sustainable tourism through interactive nudging. TRACE uses a modular orchestrator-worker architecture where specialized agents elicit latent sustainability preferences, construct structured user personas, and generate recommendations that balance relevance with environmental impact. A key innovation lies in its use of agentic counterfactual explanations and LLM-driven clarifying questions, which together surface greener alternatives and refine understanding of intent, fostering user reflection without coercion. User studies and semantic alignment analyses demonstrate that TRACE effectively supports sustainable decision-making while preserving recommendation quality and interactive responsiveness. TRACE is implemented on Google's Agent Development Kit, with full code, Docker setup, prompts, and a publicly available demo video to ensure reproducibility. A project summary, including all resources, prompts, and demo access, is available at this https URL.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.14223 [cs.IR]

(or
arXiv:2604.14223v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.14223

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26), July 20–24, 2026, Melbourne, VIC, Australia

Related DOI:

https://doi.org/10.1145/3805712.3808370

Focus to learn more

            DOI(s) linking to related resources</p>
19. 【2604.14222】Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents

链接https://arxiv.org/abs/2604.14222

作者:Afshan Hashmi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Model, grounding Large Language, Language Model outputs, Large Language, Language Model

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become the standard paradigm for grounding Large Language Model outputs in external knowledge. Lumer et al. [1] presented the first systematic evaluation comparing vector-based agentic RAG against hierarchical node-based reasoning systems for financial document QA across 1,200 SEC filings, finding vector-based systems achieved a 68% win rate. Concurrently, the PageIndex framework [2] demonstrated 98.7% accuracy on FinanceBench through purely reasoning-based retrieval. This paper extends their work by: (i) implementing and evaluating three retrieval architectures: Vector RAG, Tree Reasoning, and the proposed Adaptive Hybrid Retrieval (AHR) across financial, legal, and medical domains; (ii) introducing a four-tier query complexity benchmark; and (iii) employing GPT-4-powered LLM-as-judge evaluation. Experiments reveal that Tree Reasoning achieves the highest overall score (0.900), but no single paradigm dominates across all tiers: Vector RAG wins on multi-document synthesis (Tier 4, score 0.900), while the Hybrid AHR achieves the best performance on cross-reference (0.850) and multi-section queries (0.929). Cross-reference recall reaches 100% for tree-based and hybrid approaches versus 91.7% for vector search, quantifying a critical capability gap. Validation on FinanceBench (150 expert-annotated questions on real SEC 10-K and 10-Q filings) confirms and strengthens these findings: Tree Reasoning scores 0.938, Hybrid AHR 0.901, and Vector RAG 0.821, with the Tree--Vector quality gap widening to 11.7 percentage points on real-world documents. These findings support the development of adaptive retrieval systems that dynamically select strategies based on query complexity and document structure. All code and data are publicly available.

20. 【2604.14220】Knowledge Graph RAG: Agentic Crawling and Graph Construction in Enterprise Documents

链接https://arxiv.org/abs/2604.14220

作者:Koushik Chakraborty,Koyel Guha

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:enterprise document ecosystems, research paper addresses, complex enterprise document, document ecosystems, research paper

备注: 15 pages, 4 figures

点击查看摘要

Abstract:This research paper addresses the limitations of semantic search in complex enterprise document ecosystems. Traditional RAG pipelines often fail to capture hierarchical and interconnected information, leading to retrieval inaccuracies. We propose Agentic Knowledge Graphs featuring Recursive Crawling as a robust solution for navigating superseding logic and multi-hop references. Our benchmark evaluation using the Code of Federal Regulations (CFR) demonstrates that this Knowledge Graph-enhanced approach achieves a 70% accuracy improvement over standard vector-based RAG systems, providing exhaustive and precise answers for complex regulatory queries.

21. 【2604.14215】PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

链接https://arxiv.org/abs/2604.14215

作者:Richard Wai Cheung Chan,Shanru Lin,Ya-nan Ma,Hao Chen,Liangjun Jiang,Wenqi Fan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Kong SAR Government, public health expenditures, Hong Kong SAR, SAR Government, Government is shifting

备注: Accepted to PAKDD 2026

点击查看摘要

Abstract:To address the unsustainable rise in public health expenditures, the Hong Kong SAR Government is shifting its strategic focus to primary healthcare and encouraging citizens to use community resources to self-manage their health. However, official clinical guidelines are fragmented across disparate departments and formats, creating significant access barriers. While general-purpose Large Language Models (LLMs) such as ChatGPT and DeepSeek offer potential solutions for information accessibility, they are prone to generating factually inaccurate content due to a lack of localized and domain-specific knowledge. To this end, we propose a Retrieval-Augmented Generation-Enhanced LLM system as Primary Healthcare Assistant (PriHA) in Hong Kong. Specifically, a tri-stage pipeline is proposed that leverages a query optimizer to generalize user intent-oriented sub-queries, followed by a novel Dual Retrieval Augmented Generation (DRAG) architecture for mixed-source retrieval and context-reorganized generation. Comprehensive experiments and a detailed case study demonstrate that our proposed method can outperform both ablations and baseline in terms of accuracy and clarity. Our research provides a reliable and traceable dialogue retrieval framework for exploring other high-risk, localized application scenarios.

计算机视觉

1. 【2604.15312】Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

链接https://arxiv.org/abs/2604.15312

作者:Ninghui Xu,Fabio Tosi,Lihui Wang,Jiawei Han,Luca Bartolomei,Zhiting Yao,Matteo Poggi,Stefano Mattoccia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conventional frame-based cameras, capture rich contextual, rich contextual information, limited temporal resolution, frame-based cameras capture

备注: CVPR 2026. Code URL: [this https URL](https://github.com/xnh97/Bi-CMPStereo)

点击查看摘要

Abstract:Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

2. 【2604.15311】LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

链接https://arxiv.org/abs/2604.15311

作者:Zhanhao Liang,Tao Yang,Jie Wu,Chengjian Feng,Liang Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flow matching, human preferences, paper focuses, flow matching models, generation

备注: Accepted by CVPR 2026. Project page: [this https URL](https://rockeycoss.github.io/leapalign/)

点击查看摘要

Abstract:This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.

3. 【2604.15310】okenLight: Precise Lighting Control in Images using Attribute Tokens

链接https://arxiv.org/abs/2604.15310

作者:Sumit Chaturvedi,Yannick Hold-Geoffroy,Mengwei Ren,Jingyuan Liu,He Zhang,Yiqun Mei,Julie Dorsey,Zhixin Shu

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:multiple illumination attributes, paper presents, enables precise, precise and continuous, continuous control

备注: 32 pages, CVPR 2026

点击查看摘要

Abstract:This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Project page: this http URL

4. 【2604.15309】MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

链接https://arxiv.org/abs/2604.15309

作者:Yan Li,Zezi Zeng,Yifan Yang,Yuqing Yang,Ning Liao,Weiwei Guo,Lili Qiu,Mingxi Cheng,Qi Dai,Zhendong Wang,Zhengyuan Yang,Xue Yang,Ji Li,Lijuan Wang,Chong Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, increasingly adopted paradigm, tools enables images

备注

点击查看摘要

Abstract:The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code Data: this https URL.

5. 【2604.15308】RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

链接https://arxiv.org/abs/2604.15308

作者:Hao Gao,Shaoyu Chen,Yifan Zhu,Yuehao Song,Wenyu Liu,Qian Zhang,Xinggang Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-level autonomous driving, multimodal future uncertainties, High-level autonomous, modeling multimodal future, motion planners capable

备注: Project page: [this https URL](https://hgao-cv.github.io/RAD-2)

点击查看摘要

Abstract:High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

6. 【2604.15301】hink in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

链接https://arxiv.org/abs/2604.15301

作者:Yiyang Jiang,Li Zhang,Xiao-Yong Wei,Li Qing

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:systems quietly assume, signing map directly, SLT systems quietly, spoken-language words, systems quietly

备注: Accepted to ACL 2026 Main

点击查看摘要

Abstract:Many SLT systems quietly assume that brief chunks of signing map directly to spoken-language words. That assumption breaks down because signers often create meaning on the fly using context, space, and movement. We revisit SLT and argue that it is mainly a cross-modal reasoning task, not just a straightforward video-to-text conversion. We thus introduce a reasoning-driven SLT framework that uses an ordered sequence of latent thoughts as an explicit middle layer between the video and the generated text. These latent thoughts gradually extract and organize meaning over time. On top of this, we use a plan-then-ground decoding method: the model first decides what it wants to say, and then looks back at the video to find the evidence. This separation improves coherence and faithfulness. We also built and released a new large-scale gloss-free SLT dataset with stronger context dependencies and more realistic meanings. Experiments across several benchmarks show consistent gains over existing gloss-free methods. Code and data will be released upon acceptance at this https URL.

7. 【2604.15299】AnimationBench: Are Video Models Good at Character-Centric Animation?

链接https://arxiv.org/abs/2604.15299

作者:Leyi Wu,Pengjun Fang,Kai Sun,Yazhou Xing,Yinwei Wu,Songsong Wang,Ziqi Huang,Dan Zhou,Yingqing He,Ying-Cong Chen,Qifeng Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:convincing animated results, recent methods producing, methods producing increasingly, producing increasingly convincing, increasingly convincing animated

备注: Project Page: [this https URL](https://animationbench.github.io) Code: [this https URL](https://github.com/VideoVerses/AnimationBench)

点击查看摘要

Abstract:Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlooked by realism-oriented benchmarks, leading to more informative and discriminative evaluation of state-of-the-art I2V models.

8. 【2604.15291】AD4AD: Benchmarking Visual Anomaly Detection Models for Safer Autonomous Driving

链接https://arxiv.org/abs/2604.15291

作者:Fabrizio Genilotti,Arianna Stropeni,Gionata Grotto,Francesco Borsatti,Manuel Barusco,Davide Dalle Pezze,Gian Antonio Susto

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:driving depends heavily, training data distribution, machine vision system, data distribution, autonomous driving depends

备注

点击查看摘要

Abstract:The reliability of a machine vision system for autonomous driving depends heavily on its training data distribution. When a vehicle encounters significantly different conditions, such as atypical obstacles, its perceptual capabilities can degrade substantially. Unlike many domains where errors carry limited consequences, failures in autonomous driving translate directly into physical risk for passengers, pedestrians, and other road users. To address this challenge, we explore Visual Anomaly Detection (VAD) as a solution. VAD enables the identification of anomalous objects not present during training, allowing the system to alert the driver when an unfamiliar situation is detected. Crucially, VAD models produce pixel-level anomaly maps that can guide driver attention to specific regions of concern without requiring any prior assumptions about the nature or form of the hazard. We benchmark eight state-of-the-art VAD methods on AnoVox, the largest synthetic dataset for anomaly detection in autonomous driving. In particular, we evaluate performance across four backbone architectures spanning from large networks to lightweight ones such as MobileNet and DeiT-Tiny. Our results demonstrate that VAD transfers effectively to road scenes. Notably, Tiny-Dinomaly achieves the best accuracy-efficiency trade-off for edge deployment, matching full-scale localization performance at a fraction of the memory cost. This study represents a concrete step toward safer, more responsible deployment of autonomous vehicles, ultimately improving protection for passengers, pedestrians, and all road users.

9. 【2604.15284】GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

链接https://arxiv.org/abs/2604.15284

作者:Roni Itkin,Noam Issachar,Yehonatan Keypur,Yehonatan Keypur,Anpei Chen,Sagie Benaim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:efficient spatial allocation, Gaussian Splatting, rendering fidelity, efficient spatial, directly dictates

备注

点击查看摘要

Abstract:The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at this https URL

10. 【2604.15281】R3D: Revisiting 3D Policy Learning

链接https://arxiv.org/abs/2604.15281

作者:Zhengdong Hong,Shenrui Wu,Haozhe Cui,Boyi Zhao,Ran Ji,Yiyang He,Hangxing Zhang,Zundong Ke,Jun Wang,Guofeng Zhang,Jiayuan Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:promises superior generalization, policy learning promises, learning promises superior, perception models, cross-embodiment transfer

备注

点击查看摘要

Abstract:3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: this https URL

11. 【2604.15280】Why Do Vision Language Models Struggle To Recognize Human Emotions?

链接https://arxiv.org/abs/2604.15280

作者:Madhav Agarwal,Sotirios A. Tsaftaris,Laura Sevilla-Lara,Steven McDonagh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Understanding emotions, recognize human emotions, fundamental ability, ability for intelligent, intelligent systems

备注

点击查看摘要

Abstract:Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into common categories. We propose alternative sampling strategies that prevent favoring common concepts. Second, temporal information is critical for understanding emotions. However, VLMs are unable to represent temporal information over dense frame sequences, as they are limited by context size and the number of tokens that can fit in memory, which poses a clear challenge for emotion recognition. We demonstrate that the sparse temporal sampling strategy used in VLMs is inherently misaligned with the fleeting nature of micro-expressions (0.25-0.5 seconds), which are often the most critical affective signal. As a diagnostic probe, we propose a multi-stage context enrichment strategy that utilizes the information from "in-between" frames by first converting them into natural language summaries. This enriched textual context is provided as input to the VLM alongside sparse keyframes, preventing attentional dilution from excessive visual data while preserving the emotional trajectory.

12. 【2604.15271】SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

链接https://arxiv.org/abs/2604.15271

作者:Tianhao Fu,Austin Wang,Charles Chen,Roby Aldave-Garza,Yucheng Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:clinical decision support, Reliable uncertainty estimation, automated contours feed, contours feed downstream, feed downstream quantification

备注

点击查看摘要

Abstract:Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2604.15271 [cs.CV]

(or
arXiv:2604.15271v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.15271

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2604.15239】okenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

链接https://arxiv.org/abs/2604.15239

作者:Jiawei Ren,Michal Jan Tyszkiewicz,Jiahui Huang,Zan Gojcic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern Transformer-based approaches, key design choices, Gaussian Splatting, modern Transformer-based, Transformer-based approaches

备注: Project page: [this https URL](https://research.nvidia.com/labs/toronto-ai/tokengs)

点击查看摘要

Abstract:In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

14. 【2604.15237】StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

链接https://arxiv.org/abs/2604.15237

作者:Xuanyi Liu,Deyi Ji,Chunan Yu,Qi Zhu,Xuanfu Li,Jin Ma,Tianrun Chen,Lanyun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:constant memory budget, continuous video streams, video streams requires, streams requires stable, requires stable inference

备注

点击查看摘要

Abstract:Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

15. 【2604.15221】Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

链接https://arxiv.org/abs/2604.15221

作者:Jakob Thumm,Marian Frei,Tianle Ni,Matthias Althoff,Marco Pavone

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:certifiably safe human-robot, vision-based human pose, guarantees for certifiably, certifiably safe, conformal prediction guarantees

备注

点击查看摘要

Abstract:We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

16. 【2604.15196】Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

链接https://arxiv.org/abs/2604.15196

作者:Umer Ahmed,Syed Ahmed Mahmood,Fawad Javed Fateh,M. Shaheer Luqman,M. Zeeshan Zia,Quoc-Huy Tran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vector quantization framework, vector quantization, spatiotemporal vector quantization, hierarchical spatiotemporal vector, hierarchical

备注

点击查看摘要

Abstract:We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.

17. 【2604.15188】VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

链接https://arxiv.org/abs/2604.15188

作者:Huawei Ji,Yuanhao Sun,Yuan Jin,Cheng Deng,Jiaxin Ding,Luoyi Fu,Xinbing Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词

备注

点击查看摘要

None

18. 【2604.15173】Boundary-Centric Active Learning for Temporal Action Segmentation

链接https://arxiv.org/abs/2604.15173

作者:Halil Ismail Helvaci,Sen-ching Samson Cheung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:segmentation errors concentrate, degrade segmental metrics, shifts disproportionately degrade, disproportionately degrade segmental, Temporal action segmentation

备注

点击查看摘要

Abstract:Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queries unlabeled videos using predictive uncertainty, and (ii) within each selected video, it detects candidate transitions from the current model predictions and selects the top-$K$ boundaries via a novel boundary score that fuses neighborhood uncertainty, class ambiguity, and temporal predictive dynamics. Importantly, our annotation protocol requests labels for only the boundary frames while still training on boundary-centered clips to exploit temporal context through the model's receptive field. Extensive experiments on GTEA, 50Salads, and Breakfast demonstrate that boundary-centric supervision delivers strong label efficiency and consistently surpasses representative TAS active learning baselines and prior state of the art under sparse budgets, with the largest gains on datasets where boundary placement dominates edit and overlap-based F1 scores.

19. 【2604.15171】An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation

链接https://arxiv.org/abs/2604.15171

作者:Onno Niemann,Gonzalo Martínez Muñoz,Alberto Suárez Gonzalez

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:denoising score matching, true data density, diffusion models trained, Recent work, violate the Fokker

备注: Accepted at IJCNN 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at this https URL.

20. 【2604.15170】OmniLight: One Model to Rule All Lighting Conditions

链接https://arxiv.org/abs/2604.15170

作者:Youngjin Oh,Junyoung Park,Junhyeong Kwon,Nam Ik Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Adverse lighting conditions, computer vision systems, pose significant challenges, Adverse lighting, lighting conditions

备注: CVPRW 2026; NTIRE 2026 Image Shadow Removal Ambient Lighting Normalization Challenges (1st Perceptual Rank for White Lighting, 2nd Fidelity Rank 4th Perceptual Rank for Color Lighting)

点击查看摘要

Abstract:Adverse lighting conditions, such as cast shadows and irregular illumination, pose significant challenges to computer vision systems by degrading visibility and color fidelity. Consequently, effective shadow removal and ALN are critical for restoring underlying image content, improving perceptual quality, and facilitating robust performance in downstream tasks. However, while achieving state-of-the-art results on specific benchmarks is a primary goal in image restoration challenges, real-world applications often demand robust models capable of handling diverse domains. To address this, we present a comprehensive study on lighting-related image restoration by exploring two contrasting strategies. We leverage a robust framework for ALN, DINOLight, as a specialized baseline to exploit the characteristics of each individual dataset, and extend it to OmniLight, a generalized alternative incorporating our proposed Wavelet Domain Mixture-of-Experts (WD-MoE) that is trained across all provided datasets. Through a comparative analysis of these two methods, we discuss the impact of data distribution on the performance of specialized and unified architectures in lighting-related image restoration. Notably, both approaches secured top-tier rankings across all three lighting-related tracks in the NTIRE 2026 Challenge, demonstrating their outstanding perceptual quality and generalization capabilities. Our codes are available at this https URL.

21. 【2604.15166】Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

链接https://arxiv.org/abs/2604.15166

作者:Arman Hatami,Romina Aalishah,Ilya E. Monosov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Machine unlearning aims, remove targeted knowledge, Machine unlearning, targeted knowledge, trained model

备注: Accepted to the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV)

点击查看摘要

Abstract:Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

22. 【2604.15141】KVNN: Learnable Multi-Kernel Volterra Neural Networks

链接https://arxiv.org/abs/2604.15141

作者:Haoyu Yun,Hamid Krim,Yufang Bao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exploiting compositional features, fundamentally rooted, rooted in exploiting, exploiting compositional, Volterra Neural Network

备注

点击查看摘要

Abstract:Higher-order learning is fundamentally rooted in exploiting compositional features. It clearly hinges on enriching the representation by more elaborate interactions of the data which, in turn, tends to increase the model complexity of conventional large-scale deep learning models. In this paper, a kernelized Volterra Neural Network (kVNN) is proposed. The key to the achieved efficiency lies in using a learnable multi-kernel representation, where different interaction orders are modeled by distinct polynomial-kernel components with compact, learnable centers, yielding an order-adaptive parameterization. Features are learned by the composition of layers, each of which consists of parallel branches of different polynomial orders, enabling kVNN filters to directly replace standard convolutional kernels within existing architectures. The theoretical results are substantiated by experiments on two representative tasks: video action recognition and image denoising. The results demonstrate favorable performance-efficiency trade-offs: kVNN consistently yields reduced model (parameters) and computational (GFLOPs) complexity with competitive and often improved performance. These results are maintained even when trained from scratch without large-scale pretraining. In summary, we substantiate that structured kernelized higher-order layers offer a practical path to balancing expressivity and computational cost in modern deep networks.

23. 【2604.15134】How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

链接https://arxiv.org/abs/2604.15134

作者:Olga Loginova,Frank Keller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable procedural monitoring, video requires exposure, naturally occurring human, Reliable procedural, Psychologically Inspired Error

备注

点击查看摘要

Abstract:Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.

24. 【2604.15096】Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography

链接https://arxiv.org/abs/2604.15096

作者:Simon Böhi,Irene Cannistraci,Sergio Muñoz Gonzalez,Moritz Vandenhirtz,Sonia Laguna,Samuel Ruiperez-Campillo,Max Krähenmann,Andrea Agostini,Ece Ozkan,Thomas M. Sutter,Julia E. Vogt

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:cardiac assessment due, pose distinct challenges, heart pose distinct, heterogeneous spatiotemporal views, Attention Masked Autoencoder

备注: Accepted as a workshop paper at the ICLR 2026 Workshop on Foundation Models for Science

点击查看摘要

Abstract:Echocardiography is a widely used modality for cardiac assessment due to its non-invasive and cost-effective nature, but the sparse and heterogeneous spatiotemporal views of the heart pose distinct challenges. Existing masked autoencoder (MAE) approaches typically process images or short clips independently, failing to capture the inherent multi-view structure required for coherent cardiac representation. We introduce Latent Attention Masked Autoencoder (LAMAE), a foundation model architecture tailored to the multi-view nature of medical imaging. LAMAE augments the standard MAE with a latent attention module that enables information exchange across frames and views directly in latent space. This allows the model to aggregate variable-length sequences and distinct views, reconstructing a holistic representation of cardiac function from partial observations. We pretrain LAMAE on MIMIC-IV-ECHO, a large-scale, uncurated dataset reflecting real-world clinical variability. To the best of our knowledge, we present the first results for predicting ICD-10 codes from MIMIC-IV-ECHO videos. Furthermore, we empirically demonstrate that representations learned from adult data transfer effectively to pediatric cohorts despite substantial anatomical differences. These results provide evidence that incorporating structural priors, such as multi-view attention, yields significantly more robust and transferable representations.

25. 【2604.15093】OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

链接https://arxiv.org/abs/2604.15093

作者:Kanzhi Cheng,Zehao Li,Zheng Ma,Nuo Chen,Jialin Cao,Qiushi Sun,Zichen Ding,Fangzhi Xu,Hang Yan,Jiajun Chen,Anh Tuan Luu,Jianbing Zhang,Lewei Lu,Dahua Lin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:demonstrated impressive capabilities, recent leading models, leading models achieving, marked performance leap, automating mobile tasks

备注: Work in progress

点击查看摘要

Abstract:Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at this https URL to bridge the data gap and facilitate broader mobile agent research.

26. 【2604.15090】Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

链接https://arxiv.org/abs/2604.15090

作者:Jiaxuan Li,Xin Wen,Zhihang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Any-Time Person Re-identification, Person Re-identification, Semantic-driven Expert Routing, Expert Routing, daytime and nighttime

备注

点击查看摘要

Abstract:Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

27. 【2604.15088】Building Extraction from Remote Sensing Imagery under Hazy and Low-light Conditions: Benchmark and Baseline

链接https://arxiv.org/abs/2604.15088

作者:Feifei Sang,Wei Lu,Hongruixuan Chen,Sibao Chen,Bin Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optical Remote Sensing, Remote Sensing, optical Remote, imagery suffers, suffers from performance

备注: 14 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Building extraction from optical Remote Sensing (RS) imagery suffers from performance degradation under real-world hazy and low-light conditions. However, existing optical methods and benchmarks focus primarily on ideal clear-weather conditions. While SAR offers all-weather sensing, its side-looking geometry causes geometric distortions. To address these challenges, we introduce HaLoBuilding, the first optical benchmark specifically designed for building extraction under hazy and low-light conditions. By leveraging a same-scene multitemporal pairing strategy, we ensure pixel-level label alignment and high fidelity even under extreme degradation. Building upon this benchmark, we propose HaLoBuild-Net, a novel end-to-end framework for building extraction in adverse RS scenarios. At its core, we develop a Spatial-Frequency Focus Module (SFFM) to effectively mitigate meteorological interference on building features by coupling large receptive field attention with frequency-aware channel reweighting guided by stable low-frequency anchors. Additionally, a Global Multi-scale Guidance Module (GMGM) provides global semantic constraints to anchor building topologies, while a Mutual-Guided Fusion Module (MGFM) implements bidirectional semantic-spatial calibration to suppress shallow noise and sharpen weather-induced blurred boundaries. Extensive experiments demonstrate that HaLoBuild-Net significantly outperforms state-of-the-art methods and conventional cascaded restoration-segmentation paradigms on the HaLoBuilding dataset, while maintaining robust generalization on WHU, INRIA, and LoveDA datasets. The source code and datasets are publicly available at: this https URL.

28. 【2604.15086】ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

链接https://arxiv.org/abs/2604.15086

作者:Jianxuan Yang,Xinyue Guo,Zhi Cheng,Kai Wang,Lipan Zhang,Jinjie Hu,Qiang Ji,Yihua Cao,Yihao Meng,Zhaoyue Cui,Mengmei Liu,Meng Meng,Jian Luan

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:controllability remains challenging, Recent advances, fine-grained controllability remains, high-quality audio synthesis, remains challenging

备注

点击查看摘要

Abstract:Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: this https URL.

Subjects:

Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

Cite as:
arXiv:2604.15086 [cs.MM]

(or
arXiv:2604.15086v1 [cs.MM] for this version)

https://doi.org/10.48550/arXiv.2604.15086

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jianxuan Yang [view email] [v1]
Thu, 16 Apr 2026 14:47:24 UTC (6,356 KB)

29. 【2604.15065】Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

链接https://arxiv.org/abs/2604.15065

作者:Yangchen Zeng,Zhenyu Yu,Dongming Jiang,Wenbo Zhang,Yifan Hong,Zhanhua Hu,Jiao Luo,Kangning Cui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced small-object detection, refine low-quality queries, Transformer-based detectors, background-induced query noise, Embedding Learning Paradigm

备注: Accepted to ACM ICMR 2026; 14 pages, 6 figures, and 4 tables

点击查看摘要

Abstract:Transformer-based detectors have advanced small-object detection, but they often remain inefficient and vulnerable to background-induced query noise, which motivates deep decoders to refine low-quality queries. We present HELP (Heatmap-guided Embedding Learning Paradigm), a noise-aware positional-semantic fusion framework that studies where to embed positional information by selectively preserving positional encodings in foreground-salient regions while suppressing background clutter. Within HELP, we introduce Heatmap-guided Positional Embedding (HPE) as the core embedding mechanism and visualize it with a heatbar for interpretable diagnosis and fine-tuning. HPE is integrated into both the encoder and decoder: it guides noise-suppressed feature encoding by injecting heatmap-aware positional encoding, and it enables high-quality query retrieval by filtering background-dominant embeddings via a gradient-based mask filter before decoding. To address feature sparsity in complex small targets, we integrate Linear-Snake Convolution to enrich retrieval-relevant representations. The gradient-based heatmap supervision is used during training only, incurring no additional gradient computation at inference. As a result, our design reduces decoder layers from eight to three and achieves a 59.4% parameter reduction (66.3M vs. 163M) while maintaining consistent accuracy gains under a reduced compute budget across benchmarks. Code Repository: this https URL

30. 【2604.15059】Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment

链接https://arxiv.org/abs/2604.15059

作者:Chinmay Bakhale,Anil Sao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale automated analysis, compromising clinical diagnostics, automated analysis, present a significant, significant challenge

备注

点击查看摘要

Abstract:Motion artifacts present a significant challenge in structural MRI (sMRI), often compromising clinical diagnostics and large-scale automated analysis. While manual quality control (QC) remains the gold standard, it is increasingly unscalable for massive longitudinal studies. To address this, we propose a hybrid CNN-Attention framework designed for robust, site-invariant MRI quality assessment. Our architecture integrates a hierarchical 2D CNN encoder for local spatial feature extraction with a multi-head cross-attention mechanism to model global dependencies. This synergy enables the model to prioritize motion relevant artifact signatures, such as ringing and blurring, while dynamically filtering out site-specific intensity variations and background noise. The framework was trained end-to-end on the MR-ART dataset using a balanced cohort of 200 subjects. Performance was evaluated across two tiers: Seen Site Evaluation on a held-out MR-ART partition and Unseen Site Evaluation using 200 subjects from 17 heterogeneous sites in the ABIDE archive. On seen sites, the model achieved a scan-level accuracy of 0.9920 and an F1-score of 0.9919. Crucially, it maintained strong generalization across unseen ABIDE sites (Acc = 0.755) without any retraining or fine-tuning, demonstrating high resilience to domain shift. These results indicate that attention-based feature re-weighting successfully captures universal artifact descriptors, bridging the performance gap between diverse imaging environments and scanner manufacturers.

31. 【2604.15047】Implicit Neural Representations: A Signal Processing Perspective

链接https://arxiv.org/abs/2604.15047

作者:Dhananjaya Jayasundara,Vishal M. Patel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Implicit neural representations, Implicit neural, mark a fundamental, fundamental shift, continuous functional representations

备注

点击查看摘要

Abstract:Implicit neural representations (INRs) mark a fundamental shift in signal modeling, moving from discrete sampled data to continuous functional representations. By parameterizing signals as neural networks, INRs provide a unified framework for representing images, audio, video, 3D geometry, and beyond as continuous functions of their coordinates. This functional viewpoint enables signal operations such as differentiation to be carried out analytically through automatic differentiation rather than through discrete approximations. In this article, we examine the evolution of INRs from a signal processing perspective, emphasizing spectral behavior, sampling theory, and multiscale representation. We trace the progression from standard coordinate based networks, which exhibit a spectral bias toward low frequency components, to more advanced designs that reshape the approximation space through specialized activations, including periodic, localized, and adaptive functions. We also discuss structured representations, such as hierarchical decompositions and hash grid encodings, that improve spatial adaptivity and computational efficiency. We further highlight the utility of INRs across a broad range of applications, including inverse problems in medical and radar imaging, compression, and 3D scene representation. By interpreting INRs as learned signal models whose approximation spaces adapt to the underlying data, this article clarifies the field's core conceptual developments and outlines open challenges in theoretical stability, weight space interpretability, and large scale generalization.

32. 【2604.15038】When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

链接https://arxiv.org/abs/2604.15038

作者:Khalid Adnan Alsayed

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:automated risk assessment, healthcare decision-making, high-stakes applications, fairness, central concern

备注: 15 pages, 4 figues, 5 tables

点击查看摘要

Abstract:The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

33. 【2604.15027】Quality-Aware Calibration for AI-Generated Image Detection in the Wild

链接https://arxiv.org/abs/2604.15027

作者:Fabrizio Guillaro,Vincenzo De Rosa,Davide Cozzolino,Luisa Verdoliva

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing approaches operate, Significant progress, lose quality due, detecting synthetic images, resizing and cropping

备注: Accepted at the APAI Workshop at CVPR 2026

点击查看摘要

Abstract:Significant progress has been made in detecting synthetic images, however most existing approaches operate on a single image instance and overlook a key characteristic of real-world dissemination: as viral images circulate on the web, multiple near-duplicate versions appear and lose quality due to repeated operations like recompression, resizing and cropping. As a consequence, the same image may yield inconsistent forensic predictions based on which version has been analyzed. In this work, to address this issue we propose QuAD (Quality-Aware calibration with near-Duplicates) a novel framework that makes decisions based on all available near-duplicates of the same image. Given a query, we retrieve its online near-duplicates and feed them to a detector: the resulting scores are then aggregated based on the estimated quality of the corresponding instance. By doing so, we take advantage of all pieces of information while accounting for the reduced reliability of images impaired by multiple processing steps. To support large-scale evaluation, we introduce two datasets: AncesTree, an in-lab dataset of 136k images organized in stochastic degradation trees that simulate online reposting dynamics, and ReWIND, a real-world dataset of nearly 10k near-duplicate images collected from viral web content. Experiments on several state-of-the-art detectors show that our quality-aware fusion improves their performance consistently, with an average gain of around 8% in terms of balanced accuracy compared to plain average. Our results highlight the importance of jointly processing all the images available online to achieve reliable detection of AI-generated content in real-world applications. Code and data are publicly available at this https URL

34. 【2604.15003】Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

链接https://arxiv.org/abs/2604.15003

作者:Yuzhuo Chen,Zehua Ma,Han Fang,Hengyi Wang,Guanjie Wang,Weiming Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables realistic videos, generation enables realistic, rapid rise, enables realistic, realistic videos

备注

点击查看摘要

Abstract:The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

35. 【2604.14973】Robustness of Vision Foundation Models to Common Perturbations

链接https://arxiv.org/abs/2604.14973

作者:Hongbin Liu,Zhengyuan Jiang,Cheng Hong,Neil Zhenqiang Gong

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:JPEG compression, common editing operations, contrast adjustments, editing operations, JPEG

备注: Accepted by CVPR 2026 Workshop

点击查看摘要

Abstract:A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models' robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.

36. 【2604.14967】UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

链接https://arxiv.org/abs/2604.14967

作者:Jun Wang,Shuo Tan,Zelong Sun,Tiancheng Gu,Yongle Zhao,Ziyong Feng,Kaicheng Yang,Cewu Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:extends Large Vision-Language, Large Vision-Language Models, Retrieval-Augmented Generation, extends Large, Large Vision-Language

备注: 17 pages, 11 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.

37. 【2604.14958】Frequency-Enhanced Dual-Subspace Networks for Few-Shot Fine-Grained Image Classification

链接https://arxiv.org/abs/2604.14958

作者:Meijia Wang,Guochao Wang,Haozhen Chu,Bin Yao,Weichuan Zhang,Yuan Wang,Junpo Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image classification aims, annotated samples, aims to recognize, recognize subcategories, limited number

备注

点击查看摘要

Abstract:Few-shot fine-grained image classification aims to recognize subcategories with high visual similarity using only a limited number of annotated samples. Existing metric learning-based methods typically rely solely on spatial domain features. Confined to this single perspective, models inevitably suffer from inherent texture biases, entangling essential structural details with high-frequency background noise. Furthermore, lacking cross-view geometric constraints, single-view metrics tend to overfit this noise, resulting in structural instability under few-shot conditions. To address these issues, this paper proposes the Frequency-Enhanced Dual-Subspace Network (FEDSNet). Specifically, FEDSNet utilizes the Discrete Cosine Transform (DCT) and a low-pass filtering mechanism to explicitly isolate low-frequency global structural components from spatial features, thereby suppressing background interference. Truncated Singular Value Decomposition (SVD) is employed to construct independent, low-rank linear subspaces for both spatial texture and frequency structural features. An adaptive gating mechanism is designed to dynamically fuse the projection distances from these dual views. This strategy leverages the structural stability of the frequency subspace to prevent the spatial subspace from overfitting to background features. Extensive experiments on four benchmark datasets - CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC-Aircraft - demonstrate that FEDSNet exhibits excellent classification performance and robustness, achieving highly competitive results compared to existing metric learning algorithms. Complexity analysis further confirms that the proposed network achieves a favorable balance between high accuracy and computational efficiency, providing an effective new paradigm for few-shot fine-grained visual recognition.

38. 【2604.14953】Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation

链接https://arxiv.org/abs/2604.14953

作者:Hassan Ali,Doreen Jirak,Luca Müller,Stefan Wermter

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image processing approaches, unlike NLP, Gesture recognition research, acute data scarcity, costly human recordings

备注: Accepted at 2026 International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introduce meaningful variability and novelty that enrich the original data, further supported by superior performance of various deep models using a mixed dataset. These findings highlight that image-to-video techniques, even in their early stages, offer a powerful zero-shot approach to gesture synthesis with clear benefits for downstream tasks.

39. 【2604.14951】RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

链接https://arxiv.org/abs/2604.14951

作者:Gabriele Mattioli,Evelyn Turri,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Large Language Models, standalone language generation, invoke external resources, Multimodal Large Language, foundation models aims

备注: ICPR 2026

点击查看摘要

Abstract:Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

40. 【2604.14944】HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

链接https://arxiv.org/abs/2604.14944

作者:Jongbin Lim,Taeyun Ha,Mingi Choi,Jisoo Kim,Byungjun Kim,Subin Jeon,Hanbyul Joo

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:grasping sequences featuring, sequences featuring, diverse robotic hands, high-fidelity dexterous grasping, dexterous grasping sequences

备注

点击查看摘要

Abstract:We present HRDexDB, a large-scale, multi-modal dataset of high-fidelity dexterous grasping sequences featuring both human and diverse robotic hands. Unlike existing datasets, HRDexDB provides a comprehensive collection of grasping trajectories across human hands and multiple robot hand embodiments, spanning 100 diverse objects. Leveraging state-of-the-art vision methods and a new dedicated multi-camera system, our HRDexDB offers high-precision spatiotemporal 3D ground-truth motion for both the agent and the manipulated object. To facilitate the study of physical interaction, HRDexDB includes high-resolution tactile signals, synchronized multi-view video, and egocentric video streams. The dataset comprises 1.4K grasping trials, encompassing both successes and failures, each enriched with visual, kinematic, and tactile modalities. By providing closely aligned captures of human dexterity and robotic execution on the same target objects under comparable grasping motions, HRDexDB serves as a foundational benchmark for multi-modal policy learning and cross-domain dexterous manipulation.

41. 【2604.14933】Generative Data Augmentation for Skeleton Action Recognition

链接https://arxiv.org/abs/2604.14933

作者:Xu Dong,Wanqing Li,Anthony Adeyemi-Ejeye,Andrew Gilbert

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding human behaviour, Skeleton-based human action, human action recognition, understanding human, human behaviour

备注: Accepted at IEEE FG 2026

点击查看摘要

Abstract:Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.

42. 【2604.14928】Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting

链接https://arxiv.org/abs/2604.14928

作者:Neel Kelkar,Simon Niedermayr,Klaus Engel,Rüdiger Westermann

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:introduce a hybrid, radiance representation, representation for reconstructing, multi-view images, Gaussian scene models

备注: 22 pages, 9 figures

点击查看摘要

Abstract:We introduce a hybrid Gaussian-hash-grid radiance representation for reconstructing 2D Gaussian scene models from multi-view images. Similar to NeST splatting, our approach reduces the entanglement between geometry and appearance common in NeRF-based models, but adds per-Gaussian latent features alongside hash-grid features to bias the optimizer toward a separation of low- and high-frequency scene components. This explicit frequency-based decomposition reduces the tendency of high-frequency texture to compensate for geometric errors. Encouraging Gaussians with hard opacity falloffs further strengthens the separation between geometry and appearance, improving both geometry reconstruction and rendering efficiency. Finally, probabilistic pruning combined with a sparsity-inducing BCE opacity loss allows redundant Gaussians to be turned off, yielding a minimal set of Gaussians sufficient to represent the scene. Using both synthetic and real-world datasets, we compare against the state of the art in Gaussian-based novel-view synthesis and demonstrate superior reconstruction fidelity with an order of magnitude fewer primitives.

43. 【2604.14927】STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing

链接https://arxiv.org/abs/2604.14927

作者:Shen Fan,Mikołaj Kida,Przemyslaw Musialski

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:discretize Boundary Representations, consistent instance-level analysis, weakening consistent instance-level, Boundary Representations, pipelines discretize Boundary

备注

点击查看摘要

Abstract:Many CAD learning pipelines discretize Boundary Representations (B-Reps) into triangle meshes, discarding analytic surface structure and topological adjacency and thereby weakening consistent instance-level analysis. We present STEP-Parts, a deterministic CAD-to-supervision toolchain that extracts geometric instance partitions directly from raw STEP B-Reps and transfers them to tessellated carriers through retained source-face correspondence, yielding instance labels and metadata for downstream learning and evaluation. The construction merges adjacent B-Rep faces only when they share the same analytic primitive type and satisfy a near-tangent continuity criterion. On ABC, same-primitive dihedral angles are strongly bimodal, yielding a threshold-insensitive low-angle regime for part extraction. Because the partition is defined on intrinsic B-Rep topology rather than on a particular triangulation, the resulting boundaries remain stable under changes in tessellation. Applied to the DeepCAD subset of ABC, the pipeline processes approximately 180{,}000 models in under six hours on a consumer CPU. We release code and precomputed labels, and show that STEP-Parts serves both as a tessellation-robust geometric reference and as a useful supervision source in two downstream probes: an implicit reconstruction--segmentation network and a dataset-level point-based backbone.

44. 【2604.14914】Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

链接https://arxiv.org/abs/2604.14914

作者:Victoria Yue Chen,Emery Pierson,Léopold Maillard,Maks Ovsjanikov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unlocking numerous applications, Text-driven inversion, style transfer, generative models, paradigm for manipulating

备注

点击查看摘要

Abstract:Text-driven inversion of generative models is a core paradigm for manipulating 2D or 3D content, unlocking numerous applications such as text-based editing, style transfer, or inverse problems. However, it relies on the assumption that generative models remain sensitive to natural language prompts. We demonstrate that for state-of-the-art native text-to-3D generative models, this assumption often collapses. We identify a critical failure mode where generation trajectories are drawn into latent ``sink traps'': regions where the model becomes insensitive to prompt modifications. In these regimes, changes to the input text fail to alter internal representations in a way that alters the output geometry. Crucially, we observe that this is not a limitation of the model's \textit{geometric} expressivity; the same generative models possess the ability to produce a vast diversity of shapes but, as we demonstrate, become insensitive to out-of-distribution \textit{text} guidance. We investigate this behavior by analyzing the sampling trajectories of the generative model, and find that complex geometries can still be represented and produced by leveraging the model's unconditional generative prior. This leads to a more robust framework for text-based 3D shape editing that bypasses latent sinks by decoupling a model's geometric representation power from its linguistic sensitivity. Our approach addresses the limitations of current 3D pipelines and enables high-fidelity semantic manipulation of out-of-distribution 3D shapes. Project webpage: this https URL

45. 【2604.14910】Reward-Aware Trajectory Shaping for Few-step Visual Generation

链接https://arxiv.org/abs/2604.14910

作者:Rui Li,Bingyu Li,Yuanzhi Liang,HuangHai Bin,Chi Zhang,XueLong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Achieving high-fidelity generation, Achieving high-fidelity, generative modeling, extremely few sampling, sampling steps

备注

点击查看摘要

Abstract:Achieving high-fidelity generation in extremely few sampling steps has long been a central goal of generative modeling. Existing approaches largely rely on distillation-based frameworks to compress the original multi-step denoising process into a few-step generator. However, such methods inherently constrain the student to imitate a stronger multi-step teacher, imposing the teacher as an upper bound on student performance. We argue that introducing \textbf{preference alignment awareness} enables the student to optimize toward reward-preferred generation quality, potentially surpassing the teacher instead of being restricted to rigid teacher imitation. To this end, we propose \textbf{Reward-Aware Trajectory Shaping (RATS)}, a lightweight framework for preference-aligned few-step generation. Specifically, teacher and student latent trajectories are aligned at key denoising stages through horizon matching, while a \textbf{reward-aware gate} is introduced to adaptively regulate teacher guidance based on their relative reward performance. Trajectory shaping is strengthened when the teacher achieves higher rewards, and relaxed when the student matches or surpasses the teacher, thereby enabling continued reward-driven improvement. By seamlessly integrating trajectory distillation, reward-aware gating, and preference alignment, RATS effectively transfers preference-relevant knowledge from high-step generators without incurring additional test-time computational overhead. Experimental results demonstrate that RATS substantially improves the efficiency--quality trade-off in few-step visual generation, significantly narrowing the gap between few-step students and stronger multi-step generators.

46. 【2604.14902】ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

链接https://arxiv.org/abs/2604.14902

作者:Pei-An Chen,Yong-Ching Liang,Jia-Fong Yeh,Hung-Ting Su,Yi-Ting Chen,Min Sun,Winston Hsu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:involve unexpected conditions, Intelligent embodied agents, simply follow instructions, Intelligent embodied, conditions and exceptions

备注

点击查看摘要

Abstract:Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

47. 【2604.14888】Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models

链接https://arxiv.org/abs/2604.14888

作者:Danae Sánchez Villegas,Samuel Lewis-Lim,Nikolaos Aletras,Desmond Elliott

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:information remains unclear, Recent advances, offer reasoning capabilities, vision language models, textual information remains

备注

点击查看摘要

Abstract:Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two different model families. We track confidence over Chain-of-Thought (CoT), measure the corrective effect of reasoning, and evaluate the contribution of intermediate reasoning steps. We find that models are prone to answer inertia, in which early commitments to a prediction are reinforced, rather than revised during reasoning steps. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions, from text-dominant to vision-only settings. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient, and assess whether this influence is recoverable from CoT. Although this influence can appear in the CoT, its detectability varies across models and depends on what is being monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer and fluent CoTs can still appear visually grounded while actually following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. Taken together, these findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.

48. 【2604.14884】FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection

链接https://arxiv.org/abs/2604.14884

作者:Jianchao Huang,Fengming Zhang,Haibo Zhu,Tao Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Small object detection, complex background interference, object detection remains, significant challenge due, Small object

备注: 6 pages, 6 figures,accepted to IJCNN 2026

点击查看摘要

Abstract:Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at this https URL.

49. 【2604.14874】Open-Set Vein Biometric Recognition with Deep Metric Learning

链接https://arxiv.org/abs/2604.14874

作者:Paweł Pilarek,Marcel Musiałek,Anna Górska

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep Metric Learning, recognition methods rely, complete model retraining, methods rely, inherently limits

备注: This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in International Conference on Computational Science (ICCS 2026), and is available online at [this https URL](https://doi.org/) [pending]

点击查看摘要

Abstract:Most state-of-the-art vein recognition methods rely on closed-set classification, which inherently limits their scalability and prevents the adaptive enrollment of new users without complete model retraining. We rigorously evaluate the computational boundaries of Deep Metric Learning (DML) under strict open-set constraints. Unlike standard closed-set approaches, we analyze the impact of data scarcity and domain shift on recognition performance. Our approach learns discriminative L2-normalised embeddings and employs prototype-based matching with a calibrated similarity threshold to effectively distinguish between enrolled users and unseen impostors. We evaluate the framework under a strict subject-disjoint protocol across four diverse datasets covering finger, wrist, and dorsal hand veins (MMCBNU 6000, UTFVP, FYO, and a dorsal hand-vein dataset). On the large-scale MMCBNU 6000 benchmark, our best model (ResNet50-CBAM) achieves an OSCR of 0.9945, AUROC of 0.9974, and EER of 1.57%, maintaining high identification accuracy (99.6% Rank-1) while robustly rejecting unknown subjects. Cross-dataset experiments evaluate the framework's generalisation across different acquisition setups, confirming that while the model handles large-scale data robustly, performance remains sensitive to domain shifts in low-data regimes. Ablation studies demonstrate that triplet-based objectives combined with a simple 1-NN classifier offer an optimal trade-off between accuracy and efficiency, enabling real-time deployment on commodity hardware.

50. 【2604.14866】MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

链接https://arxiv.org/abs/2604.14866

作者:Meng-Xun Li,Wen-Hui Deng,Zhi-Xing Wu,Chun-Xiao Jin,Jia-Min Wu,Yue Han,James Kit Hon Tsoi,Gui-Song Xia,Cui Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrated significant potential, remains largely underexplored, largely underexplored due, photography remains largely, intraoral photography remains

备注: Project website: [this https URL](https://menxli.github.io/metadent)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

51. 【2604.14849】Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation

链接https://arxiv.org/abs/2604.14849

作者:Emil Benedykciuk,Marcin Denkowski,Grzegorz M. Wójcik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Adaptive skip modules, medical image segmentation, computationally costly, improve medical image, Implantable Adaptive Cells

备注: 20 pages, 7 figures

点击查看摘要

Abstract:Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable search on public medical image segmentation benchmarks. We found that operations selected in the final discrete cell typically emerge among the strongest candidates early in training, and their architecture parameters stabilize well before the final epoch. Based on this, we propose a Jensen--Shannon-divergence-based stability criterion that tracks per-edge operation-importance distributions and progressively prunes low-importance operations during search. The accelerated framework is called IAC-LTH. Results: Across four public benchmarks (ACDC, BraTS, KiTS, AMOS), several 2-D U-Net backbones, and a 2-D nnU-Net pipeline, IAC-LTH discovers IAC cells whose patient-level segmentation performance matches and sometimes slightly exceeds that of cells found by the original full-length search, while reducing wall-clock NAS cost by 3.7x to 16x across datasets and backbones. These results are consistent across architectures, benchmarks, and both non-augmented and augmented training settings, while preserving the gains of IAC-equipped U-Nets over strong attention-based and dense-skip baselines. Conclusion: Competitive IAC architectures can be identified from early-stabilizing operations without running the full search, making adaptive skip-module design more practical for medical image segmentation under realistic computational constraints.

52. 【2604.14846】Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems

链接https://arxiv.org/abs/2604.14846

作者:Haileab Yagersew

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:require expensive custom, retail theft detection, Retail theft, custom model training, Retail theft costs

备注: 16 pages, 3 figures, Code to be released at [this https URL](https://github.com/xHaileab/Paza-AI)

点击查看摘要

Abstract:Retail theft costs the global economy over \$100 billion annually, yet existing AI-based detection systems require expensive custom model training on proprietary datasets and charge \$200-500/month per store. We present Paza, a zero-shot retail theft detection framework that achieves practical concealment detection without training any model. Our approach orchestrates multiple existing models in a layered pipeline - cheap object detection and pose estimation running continuously, with an expensive vision-language model (VLM) invoked only when behavioral pre-filters trigger. A multi-signal suspicion pre-filter (requiring dwell time plus at least one behavioral signal) reduces VLM invocations by 240x compared to per-frame analysis, bounding calls to =10/minute and enabling a single GPU to serve 10-20 stores. The architecture is model-agnostic: the VLM component accepts any OpenAI-compatible endpoint, enabling operators to swap between models such as Gemma 4, Qwen3.5-Omni, GPT-4o, or future releases without code changes - ensuring the system improves as the VLM landscape evolves. We evaluate the VLM component on the DCSASS synthesized shoplifting dataset (169 clips, controlled environment), achieving 89.5% precision and 92.8% specificity at 59.3% recall zero-shot - where the recall gap is attributable to sparse frame sampling in offline evaluation rather than VLM reasoning failures, as precision and specificity are the operationally critical metrics determining false alarm rates. We present a detailed cost model showing viability at \$50-100/month per store (3-10x cheaper than commercial alternatives), and introduce a privacy-preserving design that obfuscates faces in the detection pipeline. The source code is available at this https URL.

53. 【2604.14837】Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration

链接https://arxiv.org/abs/2604.14837

作者:Geonwoo Baek,David H. Salat,Ikbeom Jang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:positron emission tomography, Alzheimer disease, MSSM, emission tomography, cerebrospinal fluid

备注: Submitted to Human Brain Mapping

点击查看摘要

Abstract:Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

54. 【2604.14816】NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

链接https://arxiv.org/abs/2604.14816

作者:Andrey Moskalenko,Alexey Bryncev,Ivan Kosmynin,Kira Shilovskaya,Mikhail Erofeev,Dmitry Vatolin,Radu Timofte,Kun Wang,Yupeng Hu,Zhiran Li,Hao Liu,Qianlong Xiang,Liqiang Nie,Konstantinos Chaldaiopoulos,Niki Efthymiou,Athanasia Zlatintsi,Panagiotis Filntisis,Katerina Pastra,Petros Maragos,Li Yang,Gen Zhan,Yiting Liao,Yabin Zhang,Yuxin Liu,Xu Wu,Yunheng Zheng,Linze Li,Kun He,Cong Wu,Xuefeng Zhu,Tianyang Xu,Xiaojun Wu,Wenzhuo Zhao,Keren Fu,Gongyang Li,Shixiang Shi,Jianlin Chen,Haibin Ling,Yaoxin Jiang,Guoyi Xu,Jiajia Liu,Yaokun Shi,Jiachen Tu

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

关键词:Video Saliency Prediction, saliency map prediction, paper presents, presents an overview, map prediction methods

备注: CVPRW 2026

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2026 Challenge on Video Saliency Prediction. The goal of the challenge participants was to develop automatic saliency map prediction methods for the provided video sequences. The novel dataset of 2,000 diverse videos with an open license was prepared for this challenge. The fixations and corresponding saliency maps were collected using crowdsourced mouse tracking and contain viewing data from over 5,000 assessors. Evaluation was performed on a subset of 800 test videos using generally accepted quality metrics. The challenge attracted over 20 teams making submissions, and 7 teams passed the final phase with code review. All data used in this challenge is made publicly available - this https URL.

55. 【2604.14805】From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation

链接https://arxiv.org/abs/2604.14805

作者:Yili Ren,Shiqi Wen,Li Hou,Dingwen Xiao,Weiming Zhang,Caleb Chen Cao,Lin Wang,Zilu Zheng,Qianxiao Su,Mingjun Zhao,Lei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:quantifying rock fabric, lithology semantic segmentation, Grain-edge segmentation, fabric and composition, lithology semantic

备注

点击查看摘要

Abstract:Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction-dependent color variations and ultra-fine grain boundaries, and 2) lacking novel modules for joint learning on multi-angle petrographic image stacks. In this paper, we propose Petro-SAM, a novel two-stage, multi-task framework that can achieve high-quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi-scale feature fusion and color-entropy priors to refine the detection.

56. 【2604.14799】Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

链接https://arxiv.org/abs/2604.14799

作者:Nishanth Madhusudhan,Vikas Yadav,Alexandre Lacoste

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:recognizing evidence insufficiency, reliable multimodal systems, refraining from answering, insufficiency and refraining, critical for reliable

备注: 10 pages and 4 figures (excluding appendix)

点击查看摘要

Abstract:Effective abstention (EA), recognizing evidence insufficiency and refraining from answering, is critical for reliable multimodal systems. Yet existing evaluation paradigms for vision-language models (VLMs) and multi-agent systems (MAS) assume answerability, pushing models to always respond. Abstention has been studied in text-only settings but remains underexplored multimodally; current benchmarks either ignore unanswerability or rely on coarse methods that miss realistic failure modes. We introduce MM-AQA, a benchmark that constructs unanswerable instances from answerable ones via transformations along two axes: visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs spanning closed and open-source models and two MAS architectures across 2079 samples, we find: (1) under standard prompting, VLMs rarely abstain; even simple confidence baselines outperform this setup, (2) MAS improves abstention but introduces an accuracy-abstention trade-off, (3) sequential designs match or exceed iterative variants, suggesting the bottleneck is miscalibration rather than reasoning depth, and (4) models abstain when image or text evidence is absent, but attempt reconciliation with degraded or contradictory evidence. Effective multimodal abstention requires abstention-aware training rather than better prompting or more agents.

57. 【2604.14782】One-shot Compositional 3D Head Avatars with Deformable Hair

链接https://arxiv.org/abs/2604.14782

作者:Yuan Sun,Xuan Wang,WeiLi Zhang,Wenxuan Zhang,Yu Guo,Fei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:constructing a complete, propose a compositional, hair, image, compositional method

备注: project page: [this https URL](https://yuansun-xjtu.github.io/CompHairHead.io)

点击查看摘要

Abstract:We propose a compositional method for constructing a complete 3D head avatar from a single image. Prior one-shot holistic approaches frequently fail to produce realistic hair dynamics during animation, largely due to inadequate decoupling of hair from the facial region, resulting in entangled geometry and unnatural deformations. Our method explicitly decouples hair from the face, modeling these components using distinct deformation paradigms while integrating them into a unified rendering pipeline. Furthermore, by leveraging image-to-3D lifting techniques, we preserve fine-grained textures from the input image to the greatest extent possible, effectively mitigating the common issue of high-frequency information loss in generalized models. Specifically, given a frontal portrait image, we first perform hair removal to obtain a bald image. Both the original image and the bald image are then lifted to dense, detail-rich 3D Gaussian Splatting (3DGS) representations. For the bald 3DGS, we rig it to a FLAME mesh via non-rigid registration with a prior model, enabling natural deformation that follows the mesh triangles during animation. For the hair component, we employ semantic label supervision combined with a boundary-aware reassignment strategy to extract a clean and isolated set of hair Gaussians. To control hair deformation, we introduce a cage structure that supports Position-Based Dynamics (PBD) simulation, allowing realistic and physically plausible transformations of the hair Gaussian primitives under head motion, gravity, and inertial effects. Striking qualitative results, including dynamic animations under diverse head motions, gravity effects, and expressions, showcase substantially more realistic hair behavior alongside faithfully preserved facial details, outperforming state-of-the-art one-shot methods in perceptual realism.

58. 【2604.14781】Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments

链接https://arxiv.org/abs/2604.14781

作者:Enrico Francesco Giannico,Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Edoardo Carosio,Filippo Salotti,Salvatore Sabina,Giorgio Buttazzo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ensuring safety, railway environments, environments is crucial, crucial for ensuring, estimate obstacle distances

备注: Under submission for publication

点击查看摘要

Abstract:Obstacle detection in railway environments is crucial for ensuring safety. However, very few studies address the problem using a complete, modular, and flexible system that can both detect objects in the scene and estimate their distance from the vehicle. Most works focus solely on detection, others attempt to identify the track, and only a few estimate obstacle distances. Additionally, evaluating these systems is challenging due to the lack of ground truth data. In this paper, we propose a modular and flexible framework that identifies the rail track, detects potential obstacles, and estimates their distance by integrating three neural networks for object detection, track segmentation, and monocular depth estimation with LiDAR point clouds. To enable a reliable and quantitative evaluation, the proposed framework is assessed using a synthetic dataset (SynDRA), which provides accurate ground truth annotations, allowing for direct performance comparison with existing methods. The proposed system achieves a mean absolute error (MAE) as low as 0.63 meters by integrating monocular depth maps with LiDAR, enabling not only accurate distance estimates but also spatial perception of the scene.

59. 【2604.14779】AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning

链接https://arxiv.org/abs/2604.14779

作者:Peifeng Zhang,Zice Qiu,Donghua Yu,Shilei Cao,Juepeng Zheng,Yutong Lu,Haohuan Fu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:visual question answering, existing Continual Learning, unimodal architectures, question answering, built for symmetric

备注: 18 pages, 9 figures. Submitted to ACM MM 2026

点击查看摘要

Abstract:In continual visual question answering (VQA), existing Continual Learning (CL) methods are mostly built for symmetric, unimodal architectures. However, modern Vision-Language Models (VLMs) violate this assumption, as their trainable components are inherently asymmetric. This structural mismatch renders VLMs highly prone to catastrophic forgetting when learning from continuous data streams. Specifically, the asymmetry causes standard global regularization to favor the massive language decoder during optimization, leaving the smaller but critical visual projection layers highly vulnerable to interference. Consequently, this localized degradation leads to a severe loss of compositional reasoning capabilities. To address this, we propose Asymmetric Information Masking (AIM), which balances stability and plasticity by applying targeted masks based on modality-specific sensitivity. Experiments on VQA v2 and GQA under continual VQA settings show that AIM achieves state-of-the-art performance in both Average Performance (AP) and Average Forgetting (AF), while better preserving generalization to novel skill-concept compositions.

60. 【2604.14762】OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism

链接https://arxiv.org/abs/2604.14762

作者:Jordan Shipard,Arnold Wiliem,Kien Nguyen Thanh,Wei Xiang,Clinton Fookes

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalized Category Discovery, textbf, Generalized Category, partially labeled data, Category Discovery

备注: Accepted to CVPR 2026 Findings

点击查看摘要

Abstract:Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, text, audio and remote sensing). This highlights the importance of strong encoders while decoupling representation learning from category discovery. Improving modality-agnostic methods will propagate across modalities, enabling encoder development independent of GCD. Our work serves as a benchmark for future modality-agnostic GCD works, paving the way for scalable, human-inspired category discovery. All code is available $\href{this https URL}{here}$

61. 【2604.14755】ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation

链接https://arxiv.org/abs/2604.14755

作者:Yanguang Sun,Hengmin Zhang,Jianjun Qian,Jian Yang,Lei Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:developing colorectal cancer, Early identification, colorectal cancer, polyp segmentation, identification and removal

备注: Accepted at TCSVT 2026

点击查看摘要

Abstract:Early identification and removal of polyps can reduce the risk of developing colorectal cancer. However, the diverse morphologies, complex backgrounds and often concealed nature of polyps make polyp segmentation in colonoscopy images highly challenging. Despite the promising performance of existing deep learning-based polyp segmentation methods, their perceptual capabilities remain biased toward local regions, mainly because of the strong spatial correlations between neighboring pixels in the spatial domain. This limitation makes it difficult to capture the complete polyp structures, ultimately leading to sub-optimal segmentation results. In this paper, we propose a novel adaptive spectrum guidance network, called ASGNet, which addresses the limitations of spatial perception by integrating spectral features with global attributes. Specifically, we first design a spectrum-guided non-local perception module that jointly aggregates local and global information, therefore enhancing the discriminability of polyp structures, and refining their boundaries. Moreover, we introduce a multi-source semantic extractor that integrates rich high-level semantic information to assist in the preliminary localization of polyps. Furthermore, we construct a dense cross-layer interaction decoder that effectively integrates diverse information from different layers and strengthens it to generate high-quality representations for accurate polyp segmentation. Extensive quantitative and qualitative results demonstrate the superiority of our ASGNet approach over 21 state-of-the-art methods across five widely-used polyp segmentation benchmarks. The code will be publicly available at: this https URL.

62. 【2604.14747】Efficient closed-form approaches for pose estimation using Sylvester forms

链接https://arxiv.org/abs/2604.14747

作者:Jana Vráblíková(AROMATH),Ezio Malis(ACENTAURI),Laurent Busé(AROMATH)

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Solving non-linear least-squares, computer vision applications, real-time computer vision, Solving non-linear, non-linear least-squares problem

备注

点击查看摘要

Abstract:Solving non-linear least-squares problem for pose estimation (rotation and translation) is often a time consuming yet fundamental problem in several real-time computer vision applications. With an adequate rotation parametrization, the optimization problem can be reduced to the solution of a~system of polynomial equations and solved in closed form. Recent advances in efficient closed form solvers utilizing resultant matrices have shown a promising research direction to decrease the computation time while preserving the estimation accuracy. In this paper, we propose a new class of resultant-based solvers that exploit Sylvester forms to further reduce the complexity of the resolution. We demonstrate that our proposed methods are numerically as accurate as the state-of-the-art solvers, and outperform them in terms of computational time. We show that this approach can be applied for pose estimation in two different types of problems: estimating a pose from 3D to 3D correspondences, and estimating a pose from 3D points to 2D points correspondences.

63. 【2604.14734】Find the Differences: Differential Morphing Attack Detection vs Face Recognition

链接https://arxiv.org/abs/2604.14734

作者:Una M. Kelly,Luuk J. Spreeuwers,Raymond N.J. Veldhuis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:morphing attack detection, attack detection solutions, morphing attacks, face recognition, Morphing

备注

点击查看摘要

Abstract:Morphing is a challenge to face recognition (FR) for which several morphing attack detection solutions have been proposed. We argue that face recognition and differential morphing attack detection (D-MAD) in principle perform very similar tasks, which we support by comparing an FR system with two existing D-MAD approaches. We also show that currently used decision thresholds inherently lead to FR systems being vulnerable to morphing attacks and that this explains the tradeoff between performance on normal images and vulnerability to morphing attacks. We propose using FR systems that are already in place for morphing detection and introduce a new evaluation threshold that guarantees an upper limit to the vulnerability to morphing attacks - even of unknown types.

64. 【2604.14724】HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

链接https://arxiv.org/abs/2604.14724

作者:Badri N. Patro,Vijay S. Agneeswaran

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Vision State Space, State Space Models, Vision State, Space Models, State Space

备注

点击查看摘要

Abstract:Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization across transfer learning and dense prediction tasks.

65. 【2604.14720】Data Synthesis Improves 3D Myotube Instance Segmentation

链接https://arxiv.org/abs/2604.14720

作者:David Exler,Nils Friederich,Martin Krüger,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Ralf Mikut,Markus Reischl

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:studying muscle physiology, multinucleated muscle fibers, muscle fibers serving, disease mechanisms, key model systems

备注: 4 pages, 4 figures, submitted to BMT (VDE) 2026 Conference

点击查看摘要

Abstract:Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.

66. 【2604.14711】MS-SSE-Net: A Multi-Scale Spatial Squeeze-and-Excitation Network for Structural Damage Detection in Civil and Geotechnical Engineering

链接https://arxiv.org/abs/2604.14711

作者:Saif ur Rehman Khan,Imad Ahmed Waqar,Arooj Zaib,Saad Ahmed,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Structural damage detection, Structural damage, civil infrastructure, detection is essential, essential for maintaining

备注

点击查看摘要

Abstract:Structural damage detection is essential for maintaining the safety and reliability of civil infrastructure. However, accurately identifying different types of structural damage from images remains challenging due to variations in damage patterns and environmental conditions. To address these challenges, this paper proposes MS-SSE-Net, a novel deep learning (DL) framework for structural damage classification. The proposed model is built upon the DenseNet201 backbone and integrates novel multi-scale feature extraction with channel and spatial attention mechanisms (MS-SSE-Net). Specifically, parallel depthwise convolutions capture both local and contextual features, while squeeze-and-excitation style channel attention and spatial attention emphasize informative regions and suppress irrelevant noise. The refined features are then processed through global average pooling and a fully connected classification layer to generate the final predictions. Experiments are conducted on the StructDamage dataset containing multiple structural damage categories. The proposed MS-SSE-Net demonstrates superior performance compared with the baseline DenseNet201 and other comparative approaches. Specifically, the proposed method achieves 99.31% precision, 99.25% recall, 99.27% F1-score, and 99.26% accuracy, outperforming the baseline model which achieved 98.62% precision, 98.53% recall, 98.58% F1-score, and 98.53% accuracy.

67. 【2604.14710】G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

链接https://arxiv.org/abs/2604.14710

作者:Jiyoung Lim,Heejae Yang,Jee-Hyong Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:retrieve target images, Large Language Models, Composed Image Retrieval, Multimodal Large Language, aims to retrieve

备注: CVPR 2026 Accepted

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). G-MIXER constructs composed query features that reflect the implicit semantics of reference image-text pairs through geodesic mixup over a range of mixup ratios, and builds a diverse candidate set. The generated candidates are then re-ranked using explicit semantics derived from MLLMs, improving both retrieval diversity and accuracy. Our proposed G-MIXER achieves state-of-the-art performance across multiple ZS-CIR benchmarks, effectively handling both implicit and explicit semantics without additional training. Our code will be available at this https URL.

68. 【2604.14706】NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation

链接https://arxiv.org/abs/2604.14706

作者:Yi He,Tao Wang,Yi Jin,Congyan Lang,Yidong Li,Haibin Ling

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Gaussian Splatting, enabled highly efficient, view synthesis, enabled highly

备注: Accepted to CVPR 2026 (Highlight)

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG-GS, a novel framework for high-quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi-resolution hash encoding for efficient multi-scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF-OVS, and ScanNet benchmarks demonstrate that our method achieves state-of-the-art performance, with significant gains in boundary mIoU. Code is available at this https URL.

69. 【2604.14703】he Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

链接https://arxiv.org/abs/2604.14703

作者:Songlin Li,Zhiqing Guo,Dan Ma,Changtao Miao,Gaobo Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incorporate authenticity-related supervision, auxiliary training signal, image manipulation localization, localization evidence opposing, existing image manipulation

备注

点击查看摘要

Abstract:Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.

70. 【2604.14692】Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链接https://arxiv.org/abs/2604.14692

作者:Zhixuan Wu,Quanxing Zha,Teng Wang,Genbao Xu,Wenyuan Gu,Wei Rao,Nan Ma,Bo Cheng,Soujanya Poria

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding requires identifying, existing object-agnostic solutions, object-agnostic solutions struggle, effectively handle substantial, Video understanding requires

备注

点击查看摘要

Abstract:Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

71. 【2604.14684】DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

链接https://arxiv.org/abs/2604.14684

作者:Bo Qian,Dahu Shi,Xing Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detection enables interactive, facilitating open-vocabulary detection, Visual, enables interactive, interactive and flexible

备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

72. 【2604.14656】Rethinking Patient Education as Multi-turn Multi-modal Interaction

链接https://arxiv.org/abs/2604.14656

作者:Zonghai Yao,Zhipeng Tang,Chengtao Lin,Xiong Luo,Benlu Wang,Juncheng Huang,Chin Siang Ong,Hong Yu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:focus on static, static tasks, image question answering, Patient education, multimodal benchmarks focus

备注: Equal contribution for the first two authors

点击查看摘要

Abstract:Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

73. 【2604.14648】Seen-to-Scene: Keep the Seen, Generate the Unseen for Video Outpainting

链接https://arxiv.org/abs/2604.14648

作者:Inseok Jeon,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Suhwan Cho,Sangyoun Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:preserving spatial fidelity, original frame boundaries, aims to expand, expand the visible, boundaries while preserving

备注: 8 pages, 8 figures (main paper); 9 pages, 10 figures (supplementary). Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, Findings

点击查看摘要

Abstract:Video outpainting aims to expand the visible content of a video beyond the original frame boundaries while preserving spatial fidelity and temporal coherence across frames. Existing methods primarily rely on large-scale generative models, such as diffusion models. However, generationbased approaches suffer from implicit temporal modeling and limited spatial context. These limitations lead to intraframe and inter-frame inconsistencies, which become particularly pronounced in dynamic scenes and large outpainting scenarios. To overcome these challenges, we propose Seen-to-Scene, a novel framework that unifies propagationbased and generation-based paradigms for video outpainting. Specifically, Seen-to-Scene leverages flow-based propagation with a flow completion network pre-trained for video inpainting, which is fine-tuned in an end-to-end manner to bridge the domain gap and reconstruct coherent motion fields. To further improve the efficiency and reliability of propagation, we introduce a reference-guided latent propagation that effectively propagates source content across frames. Extensive experiments demonstrate that our method achieves superior temporal coherence and visual realism with efficient inference, surpassing even prior state-of-the-art methods that require input-specific adaptation.

74. 【2604.14645】Chaotic CNN for Limited Data Image Classification

链接https://arxiv.org/abs/2604.14645

作者:Anusree M,Akhila Henry,Pramod P Nair

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)

关键词:Convolutional neural networks, exhibit poor generalisation, insufficient feature diversity, Convolutional neural, training data scenarios

备注

点击查看摘要

Abstract:Convolutional neural networks (CNNs) often exhibit poor generalisation in limited training data scenarios due to overfitting and insufficient feature diversity. In this work, a simple and effective chaos-based feature transformation is proposed to enhance CNN performance without increasing model complexity. The method applies nonlinear transformations using logistic, skew tent, and sine maps to normalised feature vectors before the classification layer, thereby reshaping the feature space and improving class separability. The approach is evaluated on greyscale datasets (MNIST and Fashion-MNIST) and an RGB dataset (CIFAR-10) using CNN architectures of varying depth under limited data conditions. The results show consistent improvement over the standalone (SA) CNN across all datasets. Notably, a maximum performance gain of 5.43% is achieved on MNIST using the skew tent map with a 3-layer CNN at 40 samples per class. A higher gain of 9.11% is observed on Fashion-MNIST using the sine map with a 3-layer CNN at 50 samples per class. Additionally, a strong gain of 7.47% is obtained on CIFAR-10 using the skew tent map at 200 samples per class. The consistent improvements across different chaotic maps indicate that the performance gain is driven by the shared nonlinear and dynamical properties of chaotic systems. The proposed method is computationally efficient, requires no additional trainable parameters, and can be easily integrated into existing CNN architectures, making it a practical solution for data-scarce image classification tasks.

75. 【2604.14643】Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

链接https://arxiv.org/abs/2604.14643

作者:Weiwei Zhuang,Wangze Xie,Qi Zhang,Xia Du,Zihan Lin,Zheng Lin,Hanlin Cai,Jizhe Zhou,Zihan Fang,Chi-man Pun,Wei Ni,Jun Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Adversarial attacks pose, remote sensing, deep learning models, attacks pose, pose a severe

备注: 14 pages, 11 figures

点击查看摘要

Abstract:Adversarial attacks pose a severe threat to the reliability of deep learning models in remote sensing (RS) image classification. Most existing methods rely on direct pixel-wise perturbations, failing to exploit the inherent atmospheric characteristics of RS imagery or survive real-world image degradations. In this paper, we propose FogFool, a physically plausible adversarial framework that generates fog-based perturbations by iteratively optimizing atmospheric patterns based on Perlin noise. By modeling fog formations with natural, irregular structures, FogFool generates adversarial examples that are not only visually consistent with authentic RS scenes but also deceptive. By leveraging the spatial coherence and mid-to-low-frequency nature of atmospheric phenomena, FogFool embeds adversarial information into structural features shared across diverse architectures. Extensive experiments on two benchmark RS datasets demonstrate that FogFool achieves superior performance: not only does it exceed in white-box settings, but also exhibits exceptional black-box transferability (reaching 83.74% TASR) and robustness against common preprocessing-based defenses such as JPEG compression and filtering. Detailed analyses, including confusion matrices and Class Activation Map (CAM) visualizations, reveal that our atmospheric-driven perturbations induce a universal shift in model attention. These results indicate that FogFool represents a practical, stealthy, and highly persistent threat to RS classification systems, providing a robust benchmark for evaluating model reliability in complex environments.

76. 【2604.14632】High-Speed Full-Color HDR Imaging via Unwrapping Modulo-Encoded Spike Streams

链接https://arxiv.org/abs/2604.14632

作者:Chu Zhou,Siqi Yang,Kailong Zhang,Heng Guo,Zhaofei Yu,Boxin Shi,Imari Sato

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conventional RGB-based high, irreversible information loss, Conventional RGB-based, single-shot techniques, faces a fundamental

备注: TPAMI under review

点击查看摘要

Abstract:Conventional RGB-based high dynamic range (HDR) imaging faces a fundamental trade-off between motion artifacts in multi-exposure captures and irreversible information loss in single-shot techniques. Modulo sensors offer a promising alternative by encoding theoretically unbounded dynamic range into wrapped measurements. However, existing modulo solutions remain bottlenecked by iterative unwrapping overhead and hardware constraints limiting them to low-speed, grayscale capture. In this work, we present a complete modulo-based HDR imaging system that enables high-speed, full-color HDR acquisition by synergistically advancing both the sensing formulation and the unwrapping algorithm. At the core of our approach is an exposure-decoupled formulation of modulo imaging that allows multiple measurements to be interleaved in time, preserving a clean, observation-wise measurement model. Building upon this, we introduce an iteration-free unwrapping algorithm that integrates diffusion-based generative priors with the physical least absolute remainder property of modulo images, supporting highly efficient, physics-consistent HDR reconstruction. Finally, to validate the practical viability of our system, we demonstrate a proof-of-concept hardware implementation based on modulo-encoded spike streams. This setup preserves the native high temporal resolution of spike cameras, achieving 1000 FPS full-color imaging while reducing output data bandwidth from approximately 20 Gbps to 6 Gbps. Extensive evaluations indicate that our coordinated approach successfully overcomes key systemic bottlenecks, demonstrating the feasibility of deploying modulo imaging in dynamic scenarios.

77. 【2604.14630】CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

链接https://arxiv.org/abs/2604.14630

作者:Inseok Jeon,Suhwan Cho,Minhyeok Lee,Seunghoon Lee,Minseok Kang,Jungho Lee,Chaewon Park,Donghyeong Kim,Sangyoun Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:unsupervised video object, video object segmentation, Recent advances, motion cues, advances in unsupervised

备注: 6 pages, 5 figures. Accepted to IEEE ICIP 2025

点击查看摘要

Abstract:Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

78. 【2604.14629】Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

链接https://arxiv.org/abs/2604.14629

作者:Haoyi Sun,Xiaoxiao Wang,Ning Mao,Qian Wang,Lifu Mu,Wen Zheng,Tao Wei,Wei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large scale poses, scale poses significant, poses significant challenges, shown remarkable capabilities, joint vision-language understanding

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

79. 【2604.14622】Multigrain-aware Semantic Prototype Scanning and Tri-Token Prompt Learning Embraced High-Order RWKV for Pan-Sharpening

链接https://arxiv.org/abs/2604.14622

作者:Junfeng Li,Wenyang Zhou,Xueheng Li,Xuanhua He,Jianhou Gan,Wenqi Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multigrain-aware Semantic Prototype, high-order RWKV architecture, Semantic Prototype Scanning, Prototype Scanning paradigm, Multigrain-aware Semantic

备注

点击查看摘要

Abstract:In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center difference convolution on the value pathway to inject high-frequency information, and introduce an invertible multi-scale Q-shift operation for efficient and lossless feature transformation without parameter-heavy receptive field expansion. Experimental results demonstrate the superiority of our method.

80. 【2604.14605】owards Design Compositing

链接https://arxiv.org/abs/2604.14605

作者:Abhinav Mahajan,Abhikhya Tripathy,Sudeeksha Reddy Pala,Vaibhav Methi,K J Joseph,Balaji Vasan Srinivasan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphic design creation, design creation involves, creation involves harmoniously, involves harmoniously assembling, harmoniously assembling multimodal

备注: Accepted at CVPR 2026 Workshop on CVEU

点击查看摘要

Abstract:Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo and Design-o-meter. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting. Project Page: this http URL.

81. 【2604.14591】Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models

链接https://arxiv.org/abs/2604.14591

作者:Amir El-Ghoussani,Marc Hölle,Gustavo Carneiro,Vasileios Belagiannis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual autoregressive models, source image, prompt-guided image editing, address the problem, problem of prompt-guided

备注: Accepted at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Findings (CVPRF)

点击查看摘要

Abstract:We address the problem of prompt-guided image editing in visual autoregressive models. Given a source image and a target text prompt, we aim to modify the source image according to the target prompt, while preserving all regions which are unrelated to the requested edit. To this end, we present Masked Logit Nudging, which uses the source image token maps to introduce a guidance step that aligns the model's predictions under the target prompt with these source token maps. Specifically, we convert the fixed source encodings into logits using the VAR encoding, nudging the model's predicted logits towards the targets along a semantic trajectory defined by the source-target prompts. Edits are applied only within spatial masks obtained through a dedicated masking scheme that leverages cross-attention differences between the source and edited prompts. Then, we introduce a refinement to correct quantization errors and improve reconstruction quality. Our approach achieves the best image editing performance on the PIE benchmark at 512px and 1024px resolutions. Beyond editing, our method delivers faithful reconstructions and outperforms previous methods on COCO at 512px and OpenImages at 1024px. Overall, our method outperforms VAR-related approaches and achieves comparable or even better performance than diffusion models, while being much faster. Code is available at 'this https URL.

82. 【2604.14582】MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models

链接https://arxiv.org/abs/2604.14582

作者:Ruiqi Wang,Qi Yu,Jie Ma,Hanlin Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-resolution, high cost, land-cover products, land-cover, land-cover mapping

备注

点击查看摘要

Abstract:High-resolution (HR) land-cover mapping is often constrained by the high cost of dense HR annotations. We revisit this problem from the perspective of map super-resolution, which enhances coarse low-resolution (LR) land-cover products into HR maps at the resolution of the input imagery. Existing weakly supervised methods can leverage LR labels, but they typically use them to retrain dense predictors with substantial computational cost. We propose MapSR, a prompt-driven framework that decouples supervision from model training. MapSR uses LR labels once to extract class prompts from frozen vision foundation model features through a lightweight linear probe, after which HR mapping proceeds via training-free metric inference and graph-based prediction refinement. Specifically, class prompts are estimated by aggregating high-confidence HR features identified by the linear probe, and HR predictions are obtained by cosine-similarity matching followed by graph-based propagation for spatial refinement. Experiments on the Chesapeake Bay dataset show that MapSR achieves 59.64% mIoU without any HR labels, remaining competitive with the strongest weakly supervised baseline and surpassing a fully supervised baseline. Notably, MapSR reduces trainable parameters by four orders of magnitude and shortens training time from hours to minutes, enabling scalable HR mapping under limited annotation and compute budgets. The code is available at this https URL.

83. 【2604.14580】urboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

链接https://arxiv.org/abs/2604.14580

作者:Xiangyu Liu,Feng Gao,Xiaomei Zhang,Yong Zhang,Xiaoming Wei,Zhen Lei,Xiangyu Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)

关键词:substantial computational overhead, Existing audio-driven video, video digital human, Existing audio-driven, digital human generation

备注

点击查看摘要

Abstract:Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

84. 【2604.14574】M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection

链接https://arxiv.org/abs/2604.14574

作者:Haotian Wu,Yue Cheng,Shan Bian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved unprecedented realism, facial forgery techniques, unprecedented realism, posing serious threats, information authenticity

备注

点击查看摘要

Abstract:With the rapid advancement of deep learning in image generation, facial forgery techniques have achieved unprecedented realism, posing serious threats to cybersecurity and information authenticity. Most existing deepfake detection approaches rely on the reconstruction of isolated facial attributes without fully exploiting the complementary nature of multi-modal feature representations. To address these challenges, this paper proposes a novel Multi-Modal 3D Facial Feature Reconstruction Network (M3D-Net) for deepfake detection. Our method leverages an end-to-end dual-stream architecture that reconstructs fine-grained facial geometry and reflectance properties from single-view RGB images via a self-supervised 3D facial reconstruction module. The network further enhances detection performance through a 3D Feature Pre-fusion Module (PFM), which adaptively adjusts multi-scale features, and a Multi-modal Fusion Module (MFM) that effectively integrates RGB and 3D-reconstructed features using attention mechanisms. Extensive experiments on multiple public datasets demonstrate that our approach achieves state-of-the-art performance in terms of detection accuracy and robustness, significantly outperforming existing methods while exhibiting strong generalization across diverse scenarios.

85. 【2604.14570】Deepfake Detection Generalization with Diffusion Noise

链接https://arxiv.org/abs/2604.14570

作者:Hongyuan Qi,Wenjin Hou,Hehe Fan,Jun Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesis techniques emerge, face growing challenges, detectors face growing, image synthesis techniques, Deepfake detectors face

备注: 17 pages

点击查看摘要

Abstract:Deepfake detectors face growing challenges in generalization as new image synthesis techniques emerge. In particular, deepfakes generated by diffusion models are highly photorealistic and often evade detectors trained on GAN-based forgeries. This paper addresses the generalization problem in deepfake detection by leveraging diffusion noise characteristics. We propose an Attention-guided Noise Learning (ANL) framework that integrates a pre-trained diffusion model into the deepfake detection pipeline to guide the learning of more robust features. Specifically, our method uses the diffusion model's denoising process to expose subtle artifacts: the detector is trained to predict the noise contained in an input image at a given diffusion step, forcing it to capture discrepancies between real and synthetic images, while an attention-guided mechanism derived from the predicted noise is introduced to encourage the model to focus on globally distributed discrepancies rather than local patterns. By harnessing the frozen diffusion model's learned distribution of natural images, the ANL method acts as a form of regularization, improving the detector's generalization to unseen forgery types. Extensive experiments demonstrate that ANL significantly outperforms existing methods on multiple benchmarks, achieving state-of-the-art accuracy in detecting diffusion-generated deepfakes. Notably, the proposed framework boosts generalization performance (e.g., improving ACC/AP by a substantial margin on unseen models) without introducing additional overhead during inference. Our results highlight that diffusion noise provides a powerful signal for generalizable deepfake detection.

86. 【2604.14568】Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

链接https://arxiv.org/abs/2604.14568

作者:Yixu Huang,Tinghui Zhu,Muhao Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:recently shown strong, shown strong cross-modal, strong cross-modal reasoning, cross-modal reasoning capabilities, Visual reasoning

备注

点击查看摘要

Abstract:Visual reasoning models (VRMs) have recently shown strong cross-modal reasoning capabilities by integrating visual perception with language reasoning. However, they often suffer from overthinking, producing unnecessarily long reasoning chains for any tasks. We attribute this issue to \textbf{Reasoning Path Redundancy} in visual reasoning: many visual questions do not require the full reasoning process. To address this, we propose \textbf{AVR}, an adaptive visual reasoning framework that decomposes visual reasoning into three cognitive functions: visual perception, logical reasoning, and answer application. It further enables models to dynamically choose among three response formats: Full Format, Perception-Only Format, and Direct Answer. AVR is trained with FS-GRPO, an adaptation of Group Relative Policy Optimization that encourages the model to select the most efficient reasoning format while preserving correctness. Experiments on multiple vision-language benchmarks show that AVR reduces token usage by 50--90\% while maintaining overall accuracy, especially in perception-intensive tasks. These results demonstrate that adaptive visual reasoning can effectively mitigate overthinking in VRMs. Code and data are available at: this https URL.

87. 【2604.14563】Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

链接https://arxiv.org/abs/2604.14563

作者:Mingqian Ji,Shanshan Zhang,Jian Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformer, based sparse multi-view, heavy token processing, achieved remarkable accuracy, high inference latency

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at this https URL.

88. 【2604.14560】DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration

链接https://arxiv.org/abs/2604.14560

作者:Zheng Chen,Bowen Chai,Rongjun Gao,Mingtao Nie,Xi Li,Bingnan Duan,Jianping Fang,Xiaohong Liu,Linghe Kong,Yulun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:aims to enhance, high-quality results, Video face restoration, Video face, face

备注: Code is available at: [this https URL](https://github.com/zhengchen1999/DVFace)

点击查看摘要

Abstract:Video face restoration aims to enhance degraded face videos into high-quality results with realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limit both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, yet achieving faithful facial recovery alongside temporally stable outputs remains challenging. In this paper, we propose, DVFace, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Evaluation on various benchmarks shows that DVFace delivers superior restoration quality, temporal consistency, and identity preservation compared to recent methods. Code: this https URL.

89. 【2604.14558】he Fourth Challenge on Image Super-Resolution ($\times$4) at NTIRE 2026: Benchmark Results and Method Overview

链接https://arxiv.org/abs/2604.14558

作者:Zheng Chen,Kai Liu,Jingkai Wang,Xianglong Yan,Jianze Li,Ziqing Zhang,Jue Gong,Jiatong Li,Lei Sun,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Jihye Park,Yoonjin Im,Hyungju Chun,Hyunhee Park,MinKyu Park,Zheng Xie,Xiangyu Kong,Weijun Yuan,Zhan Li,Qiurong Song,Luen Zhu,Fengkai Zhang,Xinzhe Zhu,Junyang Chen,Congyu Wang,Yixin Yang,Zhaorun Zhou,Jiangxin Dong,Jinshan Pan,Shengwei Wang,Jiajie Ou,Baiang Li,Sizhuo Ma,Qiang Gao,Jusheng Zhang,Jian Wang,Keze Wang,Yijiao Liu,Yingsi Chen,Hui Li,Yu Wang,Congchao Zhu,Saeed Ahmad,Ik Hyun Lee,Jun Young Park,Ji Hwan Yoon,Kainan Yan,Zian Wang,Weibo Wang,Shihao Zou,Chao Dong,Wei Zhou,Linfeng Li,Jaeseong Lee,Jaeho Chae,Jinwoo Kim,Seonjoo Kim,Yucong Hong,Zhenming Yan,Junye Chen,Ruize Han,Song Wang,Yuxuan Jiang,Chengxi Zeng,Tianhao Peng,Fan Zhang,David Bull,Tongyao Mu,Qiong Cao,Yifan Wang,Youwei Pan,Leilei Cao,Xiaoping Peng,Wei Deng,Yifei Chen,Wenbo Xiong,Xian Hu,Yuxin Zhang,Xiaoyun Cheng,Yang Ji,Zonghao Chen,Zhihao Xue,Junqin Hu,Nihal Kumar,Snehal Singh Tomar,Klaus Mueller,Surya Vashisth,Prateek Shaily,Jayant Kumar,Hardik Sharma,Ashish Negi,Sachin Chaudhary,Akshay Dudhane,Praful Hambarde,Amit Shukla,Shijun Shi,Jiangning Zhang,Yong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Workshop at CVPR, presents the NTIRE, NTIRE, paper presents, image super-resolution

备注: NTIRE 2026 webpage: [this https URL](https://cvlai.net/ntire/2026) . Code: [this https URL](https://github.com/zhengchen1999/NTIRE2026_ImageSR_x4)

点击查看摘要

Abstract:This paper presents the NTIRE 2026 image super-resolution ($\times$4) challenge, one of the associated competitions of the NTIRE 2026 Workshop at CVPR 2026. The challenge aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective super-resolution solutions and analyze recent advances in the field. To reflect the evolving objectives of image super-resolution, the challenge includes two tracks: (1) a restoration track, which emphasizes pixel-wise fidelity and ranks submissions based on PSNR; and (2) a perceptual track, which focuses on visual realism and evaluates results using a perceptual score. A total of 194 participants registered for the challenge, with 31 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, main results, and methods of participating teams. The challenge provides a unified benchmark and offers insights into current progress and future directions in image super-resolution.

90. 【2604.14556】Controllable Video Object Insertion via Multiview Priors

链接https://arxiv.org/abs/2604.14556

作者:Xia Qi,Peishan Cong,Yichen Yao,Ziyi Wang,Yaoqin Ye,Yuexin Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Video object insertion, Video object, object insertion, critical task, task for dynamically

备注

点击查看摘要

Abstract:Video object insertion is a critical task for dynamically inserting new objects into existing environments. Previous video generation methods focus primarily on synthesizing entire scenes while struggling with ensuring consistent object appearance, spatial alignment, and temporal coherence when inserting objects into existing videos. In this paper, we propose a novel solution for Video Object Insertion, which integrates multi-view object priors to address the common challenges of appearance inconsistency and occlusion handling in dynamic environments. By lifting 2D reference images into multi-view representations and leveraging a dual-path view-consistent conditioning mechanism, our framework ensures stable identity guidance and robust integration across diverse viewpoints. A quality-aware weighting mechanism is also employed to adaptively handle noisy or imperfect inputs. Additionally, we introduce an Integration-Aware Consistency Module that guarantees spatial realism, effectively resolving occlusion and boundary artifacts while maintaining temporal continuity across frames. Experimental results show that our solution significantly improves the quality of video object insertion, providing stable and realistic integration.

91. 【2604.14541】Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars

链接https://arxiv.org/abs/2604.14541

作者:Yicheng Gong,Jiawei Zhang,Liqiang Liu,Yanwen Wang,Lei Chu,Jiahao Li,Hao Pan,Hao Zhu,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:explicit emotion control, emotion, first-class control signal, explicit emotion, Abstract

备注

点击查看摘要

Abstract:We present a framework for explicit emotion control in feed-forward, single-image 3D head avatar reconstruction. Unlike existing pipelines where emotion is implicitly entangled with geometry or appearance, we treat emotion as a first-class control signal that can be manipulated independently and consistently across identities. Our method injects emotion into existing feed-forward architectures via a dual-path modulation mechanism without modifying their core design. Geometry modulation performs emotion-conditioned normalization in the original parametric space, disentangling emotional state from speech-driven articulation, while appearance modulation captures identity-aware, emotion-dependent visual cues beyond geometry. To enable learning under this setting, we construct a time-synchronized, emotion-consistent multi-identity dataset by transferring aligned emotional dynamics across identities. Integrated into multiple state-of-the-art backbones, our framework preserves reconstruction and reenactment fidelity while enabling controllable emotion transfer, disentangled manipulation, and smooth emotion interpolation, advancing expressive and scalable 3D head avatars.

92. 【2604.14540】WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms

链接https://arxiv.org/abs/2604.14540

作者:Yucheng Pan,Heping Li,Zhangle Liu,Sajid Hussain,Bin Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, Interferometric Synthetic Aperture, efficient geohazard monitoring, complex coherence noise, Aperture Radar

备注

点击查看摘要

Abstract:Detecting slow-moving landslides directly from wrapped Interferometric Synthetic Aperture Radar (InSAR) interferograms is crucial for efficient geohazard monitoring, yet it remains fundamentally challenged by severe phase ambiguity and complex coherence noise. While the Segment Anything Model (SAM) offers a powerful foundation for segmentation, its direct transfer to wrapped phase data is hindered by a profound spectral domain shift, which suppresses the high-frequency fringes essential for boundary delineation. To bridge this gap, we propose WILD-SAM, a novel parameter-efficient fine-tuning framework specifically designed to adapt SAM for high-precision landslide detection on wrapped interferograms. Specifically, the architecture integrates a Phase-Aware Mixture-of-Experts (PA-MoE) Adapter into the frozen encoder to align spectral distributions and introduces a Wavelet-Guided Subband Enhancement (WGSE) strategy to generate frequency-aware dense prompts. The PA-MoE Adapter exploits a dynamic routing mechanism across heterogeneous convolutional experts to adaptively aggregate multi-scale spectral-textural priors, effectively aligning the distribution discrepancy between natural images and interferometric phase data. Meanwhile, the WGSE strategy leverages discrete wavelet transforms to explicitly disentangle high-frequency subbands and refine directional phase textures, injecting these structural cues as dense prompts to ensure topological integrity along sharp landslide boundaries. Extensive experiments on the ISSLIDE and ISSLIDE+ benchmarks demonstrate that WILD-SAM achieves state-of-the-art performance, significantly outperforming existing methods in both target completeness and contour fidelity.

93. 【2604.14527】Design and Validation of a Low-Cost Smartphone Based Fluorescence Detection Platform Compared with Conventional Microplate Readers

链接https://arxiv.org/abs/2604.14527

作者:Zhendong Cao,Katrina G. Salvante,Ash Parameswaran,Pablo A. Nepomnaschy,Hongji Dai

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)

关键词:low cost fluorescence-based, cost fluorescence-based optical, fluorescence-based optical system, Perkin Elmer Victor, low cost

备注: 4 pages

点击查看摘要

Abstract:A low cost fluorescence-based optical system is developed for detecting the presence of certain microorganisms and molecules within a diluted sample. A specifically designed device setup compatible with conventional 96 well plates is chosen to create an ideal environment in which a smart phone camera can be used as the optical detector. In comparison with conventional microplate reading machines such as Perkin Elmer Victor Machine, the device presented in this paper is not equipped with expensive elements such as exciter filer, barrier filter and photomultiplier; instead, a phone camera is all needed to detect fluorescence within the sample. The strategy being involved is to determine the relationship between the image color of the sample in RGB color space and the molar concentration of the fluorescence specimen in that sample. This manuscript is a preprint version of work related to a publication in IEEE. The final version may differ from this manuscript.

94. 【2604.14526】FreqTrack: Frequency Learning based Vision Transformer for RGB-Event Object Tracking

链接https://arxiv.org/abs/2604.14526

作者:Jinlin You,Muyu Li,Xudong Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing single-modal RGB, single-modal RGB trackers, event sensors offers, Existing single-modal, enhancing tracking capabilities

备注

点击查看摘要

Abstract:Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive performance, particularly attaining leading precision of 76.6\% on the COESOT benchmark, validating the effectiveness of frequency-domain modeling for RGBE tracking.

95. 【2604.14520】Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

链接https://arxiv.org/abs/2604.14520

作者:Ziyang Luo,Nian Liu,Junwei Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Omni-modal Large Language, Large Language Models, Omni-modal Large, Large Language, diverse sensory streams

备注

点击查看摘要

Abstract:Omni-modal Large Language Models (Omni-MLLMs) promise a unified integration of diverse sensory streams. However, recent evaluations reveal a critical performance paradox: unimodal baselines frequently outperform joint multimodal inference. We trace this perceptual fragility to the static fusion topologies universally employed by current models, identifying two structural pathologies: positional bias in sequential inputs and alignment traps in interleaved formats, which systematically distort attention regardless of task semantics. To resolve this functional rigidity, we propose Chain of Modality (CoM), an agentic framework that transitions multimodal fusion from passive concatenation to dynamic orchestration. CoM adaptively orchestrates input topologies, switching among parallel, sequential, and interleaved pathways to neutralize structural biases. Furthermore, CoM bifurcates cognitive execution into two task-aligned pathways: a streamlined ``Direct-Decide'' path for direct perception and a structured ``Reason-Decide'' path for analytical auditing. Operating in either a training-free or a data-efficient SFT setting, CoM achieves robust and consistent generalization across diverse benchmarks.

96. 【2604.14519】CI-CBM: Class-Incremental Concept Bottleneck Model for Interpretable Continual Learning

链接https://arxiv.org/abs/2604.14519

作者:Amirhosein Javadi,Tuomas Oikarinen,Tara Javidi,Tsui-Wei Weng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Catastrophic forgetting remains, forget previous knowledge, remains a fundamental, knowledge when fine-tuned, Catastrophic forgetting

备注: 31 pages, 6 figures. Published in Transactions on Machine Learning Research (TMLR), 04/2026

点击查看摘要

Abstract:Catastrophic forgetting remains a fundamental challenge in continual learning, in which models often forget previous knowledge when fine-tuned on a new task. This issue is especially pronounced in class incremental learning (CIL), which is the most challenging setting in continual learning. Existing methods to address catastrophic forgetting often sacrifice either model interpretability or accuracy. To address this challenge, we introduce ClassIncremental Concept Bottleneck Model (CI-CBM), which leverage effective techniques, including concept regularization and pseudo-concept generation to maintain interpretable decision processes throughout incremental learning phases. Through extensive evaluation on seven datasets, CI-CBM achieves comparable performance to black-box models and outperforms previous interpretable approaches in CIL, with an average 36% accuracy gain. CICBM provides interpretable decisions on individual inputs and understandable global decision rules, as shown in our experiments, thereby demonstrating that human understandable concepts can be maintained during incremental learning without compromising model performance. Our approach is effective in both pretrained and non-pretrained scenarios; in the latter, the backbone is trained from scratch during the first learning phase. Code is publicly available at this http URL.

97. 【2604.14507】H2VLR: Heterogeneous Hypergraph Vision-Language Reasoning for Few-Shot Anomaly Detection

链接https://arxiv.org/abs/2604.14507

作者:Jianghong Huang,Luping Ji,Weiwei Duan,Mao Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:classic vision task, classic vision, widely applied, vision task, FSAD

备注: 9 pages, 5 figures

点击查看摘要

Abstract:As a classic vision task, anomaly detection has been widely applied in industrial inspection and medical imaging. In this task, data scarcity is often a frequently-faced issue. To solve it, the few-shot anomaly detection (FSAD) scheme is attracting increasing attention. In recent years, beyond traditional visual paradigm, Vision-Language Model (VLM) has been extensively explored to boost this field. However, in currently-existing VLM-based FSAD schemes, almost all perform anomaly inference only by pairwise feature matching, ignoring structural dependencies and global consistency. To further redound to FSAD via VLM, we propose a Heterogeneous Hypergraph Vision-Language Reasoning (H2VLR) framework. It reformulates the FSAD as a high-order inference problem of visual-semantic relations, by jointly modeling visual regions and semantic concepts in a unified hypergraph. Experimental comparisons verify the effectiveness and advantages of H2VLR. It could often achieve state-of-the-art (SOTA) performance on representative industrial and medical benchmarks. Our code will be released upon acceptance.

98. 【2604.14506】Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

链接https://arxiv.org/abs/2604.14506

作者:Jue Jiang,Aneesh Rangnekar,Harini Veeraraghavan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Masked image modeling, Masked image, highly effective self-supervised, unannotated data, extract useful feature

备注: Accepted at MIDL 2025

点击查看摘要

Abstract:Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.

99. 【2604.14454】CooperDrive: Enhancing Driving Decisions Through Cooperative Perception

链接https://arxiv.org/abs/2604.14454

作者:Deyuan Qu,Qi Chen,Takayuki Shimizu,Onur Altintas

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous vehicles equipped, robust onboard perception, increase collision risk, Autonomous vehicles, collision risk

备注: Accepted at ICRA 2026

点击查看摘要

Abstract:Autonomous vehicles equipped with robust onboard perception, localization, and planning still face limitations in occlusion and non-line-of-sight (NLOS) scenarios, where delayed reactions can increase collision risk. We propose CooperDrive, a cooperative perception framework that augments situational awareness and enables earlier, safer driving decisions. CooperDrive offers two key advantages: (i) each vehicle retains its native perception, localization, and planning stack, and (ii) a lightweight object-level sharing and fusion strategy bridges perception and planning. Specifically, CooperDrive reuses detector Bird's-Eye View (BEV) features to estimate accurate vehicle poses without additional heavy encoders, thereby reconstructing BEV representations and feeding the planner with low latency. On the planning side, CooperDrive leverages the expanded object set to anticipate potential conflicts earlier and adjust speed and trajectory proactively, thereby transforming reactive behaviors into predictive and safer driving decisions. Real-world closed-loop tests at occlusion-heavy NLOS intersections demonstrate that CooperDrive increases reaction lead time, minimum time-to-collision (TTC), and stopping margin, while requiring only 90 kbps bandwidth and maintaining an average end-to-end latency of 89 ms.

100. 【2604.14449】Crowdsourcing of Real-world Image Annotation via Visual Properties

链接https://arxiv.org/abs/2604.14449

作者:Xiaolei Diao,Fausto Giunchiglia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:data-centric artificial intelligence, artificial intelligence highlight, intelligence highlight inherent, highlight inherent limitations, Recent advances

备注

点击查看摘要

Abstract:Recent advances in data-centric artificial intelligence highlight inherent limitations in object recognition datasets. One of the primary issues stems from the semantic gap problem, which results in complex many-to-many mappings between visual data and linguistic descriptions. This bias adversely affects performance in computer vision tasks. This paper proposes an image annotation methodology that integrates knowledge representation, natural language processing, and computer vision techniques, aiming to reduce annotator subjectivity by applying visual property constraints. We introduce an interactive crowdsourcing framework that dynamically asks questions based on a predefined object category hierarchy and annotator feedback, guiding image annotation by visual properties. Experiments demonstrate the effectiveness of this methodology, and annotator feedback is discussed to optimize the crowdsourcing setup.

101. 【2604.14433】Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

链接https://arxiv.org/abs/2604.14433

作者:Felipe Parodi,Jordan Matelsky,Melanie Segado

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:probe token function, replacing token activations, replacing token, vision transformers, probe token

备注: 12 pages, 10 figures, to be published in CVPR 2026 HOW Vision Interpretability Workshop Proceedings

点击查看摘要

Abstract:Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

102. 【2604.14388】FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

链接https://arxiv.org/abs/2604.14388

作者:Sabab Ishraq(1),Aarushi Aarushi(2),Juncai Jiang(2),Chen Chen(3) ((1) University of Central Florida, College of Engineering and Computer Science, Orlando, FL, USA, (2) University of Central Florida, College of Business Administration, Orlando, FL, USA, (3) University of Central Florida, Institute of Artificial Intelligence, Orlando, FL, USA)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Humans routinely infer, routinely infer taste, routinely infer, phenomenon well studied, food images

备注

点击查看摘要

Abstract:Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

103. 【2604.14379】Step-level Denoising-time Diffusion Alignment with Multiple Objectives

链接https://arxiv.org/abs/2604.14379

作者:Qi Zhang,Dawei Wang,Shaofeng Zou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement learning, single reward function, typically by optimizing, regularization constraint, human preferences

备注

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective base models. We prove that this denoising-time objective is exactly equivalent to the step-level RL fine-tuning, introducing no approximation error. Moreover, we provide numerical results, which indicate our method outperforms existing denoising-time approaches.

104. 【2604.14373】SatBLIP: Context Understanding and Feature Identification from Satellite Imagery with Vision-Language Learning

链接https://arxiv.org/abs/2604.14373

作者:Xue Wu,Shengting Cao,Jiaqi Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:provide limited insight, standard vulnerability indices, Social Vulnerability Index, housing quality, land-surface patterns

备注

点击查看摘要

Abstract:Rural environmental risks are shaped by place-based conditions (e.g., housing quality, road access, land-surface patterns), yet standard vulnerability indices are coarse and provide limited insight into risk contexts. We propose SatBLIP, a satellite-specific vision-language framework for rural context understanding and feature identification that predicts county-level Social Vulnerability Index (SVI). SatBLIP addresses limitations of prior remote sensing pipelines-handcrafted features, manual virtual audits, and natural-image-trained VLMs-by coupling contrastive image-text alignment with bootstrapped captioning tailored to satellite semantics. We use GPT-4o to generate structured descriptions of satellite tiles (roof type/condition, house size, yard attributes, greenery, and road context), then fine-tune a satellite-adapted BLIP model to generate captions for unseen images. Captions are encoded with CLIP and fused with LLM-derived embeddings via attention for SVI estimation under spatial aggregation. Using SHAP, we identify salient attributes (e.g., roof form/condition, street width, vegetation, cars/open space) that consistently drive robust predictions, enabling interpretable mapping of rural risk environments.

105. 【2604.14363】he Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

链接https://arxiv.org/abs/2604.14363

作者:Akshay Paruchuri,Ishan Chatterjee,Henry Fuchs,Ehsan Adeli,Piotr Didyk

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains poorly understood, failure remains poorly, models systematically underperform, poorly understood, systematically underperform

备注: 29 pages, 9 figures, 19 tables

点击查看摘要

Abstract:Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4$\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than preference-optimized models (+1.5% on average). Our findings suggest that modal competition is structurally localized, correctable at inference time without retraining, and quantifiable as a diagnostic signal to guide future multimodal training.

106. 【2604.14329】Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos

链接https://arxiv.org/abs/2604.14329

作者:Bryan Jhoan Cazáres Leyva,Ulises Gachuz Davila,José Juan González Fonseca,Juan Irving Vasquez,Vanessa A. Camacho-Vázquez,Sergio Isahí Garrido-Castañeda

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Non-violent street robberies, unconstrained surveillance footage, benign human interactions, Non-violent street, street robberies

备注: submitted to MCPR

点击查看摘要

Abstract:Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.

107. 【2604.14314】DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

链接https://arxiv.org/abs/2604.14314

作者:Gabriel Pimenta de Freitas Cardoso,Caio Lucas da Silva Chacon,Jonas Felipe da Fonseca Oliveira,Paulo Henrique de Medeiros Araujo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:specialized small language, jointly optimize transcription, optimize transcription quality, small language models, introduces DharmaOCR Full

备注

点击查看摘要

Abstract:This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.

108. 【2604.14302】Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

链接https://arxiv.org/abs/2604.14302

作者:Ahmed Bourouis,Savas Ozkan,Andrea Maracani,Yi-Zhe Song,Mete Ozay

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating geometrically consistent, single freehand sketch, generating geometrically, freehand sketch, geometrically consistent multi-view

备注

点击查看摘要

Abstract:We tackle a new problem: generating geometrically consistent multi-view scenes from a single freehand sketch. Freehand sketches are the most geometrically impoverished input one could offer a multi-view generator. They convey scene intent through abstract strokes while introducing spatial distortions that actively conflict with any consistent 3D interpretation. No prior method attempts this; existing multi-view approaches require photographs or text, while sketch-to-3D methods need multiple views or costly per-scene optimisation. We address three compounding challenges; absent training data, the need for geometric reasoning from distorted 2D input, and cross-view consistency, through three mutually reinforcing contributions: (i) a curated dataset of $\sim$9k sketch-to-multiview samples, constructed via an automated generation and filtering pipeline; (ii) Parallel Camera-Aware Attention Adapters (CA3) that inject geometric inductive biases into the video transformer; and (iii) a Sparse Correspondence Supervision Loss (CSL) derived from Structure-from-Motion reconstructions. Our framework synthesizes all views in a single denoising process without requiring reference images, iterative refinement, or per-scene optimization. Our approach significantly outperforms state-of-the-art two-stage baselines, improving realism (FID) by over 60% and geometric consistency (Corr-Acc) by 23%, while providing up to a 3.7$\times$ inference speedup.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.14302 [cs.CV]

(or
arXiv:2604.14302v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.14302

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
109. 【2604.14268】HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

链接https://arxiv.org/abs/2604.14268

作者:Team HY-World,Chenjie Cao,Xuhui Zuo,Zhenwei Wang,Yisu Zhang,Junta Wu,Zhenyang Liu,Yuning Gong,Yang Liu,Bo Yuan,Chao Zhang,Coopers Li,Dongyuan Guo,Fan Yang,Haiyu Zhang,Hang Cao,Jianchen Zhu,Jiaxin Lin,Jie Xiao,Jihong Zhang,Junlin Yu,Lei Wang,Lifu Wang,Lilin Wang,Linus,Minghui Chen,Peng He,Penghao Zhao,Qi Chen,Rui Chen,Rui Shao,Sicong Liu,Wangchen Qin,Xiaochuan Niu,Xiang Yuan,Yi Sun,Yifei Tang,Yifu Sun,Yihang Lian,Yonghao Tan,Yuhong Liu,Yuyang Yin,Zhiyuan Min,Tengfei Wang,Chunchao Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prior project HY-World, framework that advances, advances our prior, prior project, project HY-World

备注: Project Page: [this https URL](https://3d-models.hunyuan.tencent.com/world/) ; Code: [this https URL](https://github.com/Tencent-Hunyuan/HY-World-2.0)

点击查看摘要

Abstract:We introduce HY-World 2.0, a multi-modal world model framework that advances our prior project HY-World 1.0. HY-World 2.0 accommodates diverse input modalities, including text prompts, single-view images, multi-view images, and videos, and produces 3D world representations. With text or single-view image inputs, the model performs world generation, synthesizing high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes. This is achieved through a four-stage method: a) Panorama Generation with HY-Pano 2.0, b) Trajectory Planning with WorldNav, c) World Expansion with WorldStereo 2.0, and d) World Composition with WorldMirror 2.0. Specifically, we introduce key innovations to enhance panorama fidelity, enable 3D scene understanding and planning, and upgrade WorldStereo, our keyframe-based view generation model with consistent memory. We also upgrade WorldMirror, a feed-forward model for universal 3D prediction, by refining model architecture and learning strategy, enabling world reconstruction from multi-view images or videos. Also, we introduce WorldLens, a high-performance 3DGS rendering platform featuring a flexible engine-agnostic architecture, automatic IBL lighting, efficient collision detection, and training-rendering co-design, enabling interactive exploration of 3D worlds with character support. Extensive experiments demonstrate that HY-World 2.0 achieves state-of-the-art performance on several benchmarks among open-source approaches, delivering results comparable to the closed-source model Marble. We release all model weights, code, and technical details to facilitate reproducibility and support further research on 3D world models.

110. 【2604.14216】Neuro-Oracle: A Trajectory-Aware Agentic RAG Framework for Interpretable Epilepsy Surgical Prognosis

链接https://arxiv.org/abs/2604.14216

作者:Aizierjiang Aiersilan,Mohamad Koubeissi

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Predicting post-surgical seizure, post-surgical seizure outcomes, Predicting post-surgical, post-surgical seizure, seizure outcomes

备注

点击查看摘要

Abstract:Predicting post-surgical seizure outcomes in pharmacoresistant epilepsy is a clinical challenge. Conventional deep-learning approaches operate on static, single-timepoint pre-operative scans, omitting longitudinal morphological changes. We propose \emph{Neuro-Oracle}, a three-stage framework that: (i) distils pre-to-post-operative MRI changes into a compact 512-dimensional trajectory vector using a 3D Siamese contrastive encoder; (ii) retrieves historically similar surgical trajectories from a population archive via nearest-neighbour search; and (iii) synthesises a natural-language prognosis grounded in the retrieved evidence using a quantized Llama-3-8B reasoning agent. Evaluations are conducted on the public EPISURG dataset ($N{=}268$ longitudinally paired cases) using five-fold stratified cross-validation. Since ground-truth seizure-freedom scores are unavailable, we utilize a clinical proxy label based on the resection type. We acknowledge that the network representations may potentially learn the anatomical features of the resection cavities (i.e., temporal versus non-temporal locations) rather than true prognostic morphometry. Our current evaluation thus serves mainly as a proof-of-concept for the trajectory-aware retrieval architecture. Trajectory-based classifiers achieve AUC values between 0.834 and 0.905, compared with 0.793 for a single-timepoint ResNet-50 baseline. The Neuro-Oracle agent (M5) matches the AUC of purely discriminative trajectory classifiers (0.867) while producing structured justifications with zero observed hallucinations under our audit protocol. A Siamese Diversity Ensemble (M6) of trajectory-space classifiers attains an AUC of 0.905 without language-model overhead.

111. 【2604.14193】QualiaNet: An Experience-Before-Inference Network

链接https://arxiv.org/abs/2604.14193

作者:Paul Linton

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)

关键词:Inference Module, distinct stages, relative to fixation, involves two distinct, depth is extracted

备注

点击查看摘要

Abstract:Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although our experience of stereo vision does not provide us with distance information, it does affect our inferences about visual scale. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.

112. 【2604.14800】Generative Modeling of Complex-Valued Brain MRI Data

链接https://arxiv.org/abs/2604.14800

作者:Marco Schlimbach,Moritz Rempe,Jessica Mnischek,Lukas T. Rotkopf,Jens Weingarten,Jens Kleesiek,Kevin Kröninger

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Magnetic Resonance Imaging, Standard Magnetic Resonance, Resonance Imaging, Magnetic Resonance, MRI

备注: 16 pages, 8 figures

点击查看摘要

Abstract:Objective. Standard Magnetic Resonance Imaging (MRI) reconstruction pipelines discard phase information captured during acquisition, despite evidence that it encodes tissue properties relevant to tumor diagnosis. Current machine learning approaches inherit this limitation by operating exclusively on reconstructed magnitude images. The aim of this study is to build a generative framework which is capable of jointly modeling magnitude and phase information of complex-valued MRI scans. Approach. The proposed generative framework combines a conditional variational autoencoder, which compresses complex-valued MRI scans into compact latent representations while preserving phase coherence, with a flow-matching-based generative model. Synthetic sample quality is assessed via a real-versus-synthetic classifier and by training downstream classifiers on synthetic data for abnormal tissue detection. Main results. The autoencoder preserves phase coherence above 0.997. Real-versus-synthetic classification yields low AUROC values between 0.50 and 0.66 across all acquisition sequences, indicating generated samples are nearly indistinguishable from real data. In downstream normal-versus-abnormal classification, classifiers trained entirely on synthetic data achieve an AUROC of 0.880, surpassing the real-data baseline of 0.842 on a publicly available dataset (fastMRI). This advantage persists on an independent external test set from a different institution with biopsy-confirmed labels. Significance. The proposed framework demonstrates the feasibility of jointly modeling magnitude and phase information for normal and abnormal complex-valued brain MRI data. Beyond synthetic data generation, it establishes a foundation for the usage of complete brain MRI information in future diagnostic applications and enables systematic investigation of how magnitude and phase jointly encode pathology-specific features.

113. 【2604.14451】FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology

链接https://arxiv.org/abs/2604.14451

作者:Biwei Dai,Po-Wen Chang,Wahid Bhimji,Paolo Calafiura,Ragansu Chakkappai,Yuan-Tang Chou,Sascha Diefenbacher,Jordan Dudley,Ibrahim Elsharkawy,Steven Farrell,Isabelle Guyon,Chris Harris,Elham E Khoda,Benjamin Nachman,David Rousseau,Uroš Seljak,Ihsan Ullah,Yulei Zhang

类目:Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Data Analysis, Statistics and Probability (physics.data-an)

关键词:background galaxy shapes, weak lensing, Weak gravitational lensing, weak lensing data, Weak Lensing Machine

备注: Whitepaper for the FAIR Universe Weak Lensing ML Uncertainty Challenge Competition. More info is available at our GitHub repository [this https URL](https://github.com/FAIR-Universe/Cosmology_Challenge) . 13 pages, 5 figures, 1 table

点击查看摘要

Abstract:Weak gravitational lensing, the correlated distortion of background galaxy shapes by foreground structures, is a powerful probe of the matter distribution in our universe and allows accurate constraints on the cosmological model. In recent years, high-order statistics and machine learning (ML) techniques have been applied to weak lensing data to extract the nonlinear information beyond traditional two-point analysis. However, these methods typically rely on cosmological simulations, which poses several challenges: simulations are computationally expensive, limiting most realistic setups to a low training data regime; inaccurate modeling of systematics in the simulations create distribution shifts that can bias cosmological parameter constraints; and varying simulation setups across studies make method comparison difficult. To address these difficulties, we present the first weak lensing benchmark dataset with several realistic systematics and launch the FAIR Universe Weak Lensing Machine Learning Uncertainty Challenge. The challenge focuses on measuring the fundamental properties of the universe from weak lensing data with limited training set and potential distribution shifts, while providing a standardized benchmark for rigorous comparison across methods. Organized in two phases, the challenge will bring together the physics and ML communities to advance the methodologies for handling systematic uncertainties, data efficiency, and distribution shifts in weak lensing analysis with ML, ultimately facilitating the deployment of ML approaches into upcoming weak lensing survey analysis.

114. 【2604.14263】A deep learning framework for glomeruli segmentation with boundary attention

链接https://arxiv.org/abs/2604.14263

作者:Behnaz Elhaminia,Catherine King,Jiaqi Lv,Lorraine Harper,Paul Moss,Owen Cain,Dimitrios Chanouzas,Shan E Ahmed Raza

类目:Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate detection, diagnostic applications, kidney tissue, tissue are essential, essential for diagnostic

备注

点击查看摘要

Abstract:Accurate detection and segmentation of glomeruli in kidney tissue are essential for diagnostic applications. Traditional deep learning methods primarily rely on semantic segmentation, which often fails to precisely delineate adjacent glomeruli. To address this challenge, we propose a novel glomerulus detection and segmentation model that emphasises boundary separation. Leveraging pathology foundation models, the proposed U-Net-based architecture incorporates a specialised attention decoder designed to highlight critical regions and improve instancelevel segmentation. Experimental evaluations demonstrate that our approach surpasses state-of-the-art methods in both Dice score and Intersection over Union, indicating superior performance in glomerular delineation.

115. 【2603.27118】Quantitative measurements of biological/chemical concentrations using smartphone cameras

链接https://arxiv.org/abs/2603.27118

作者:Zhendong Cao,Hongji Dai,Zhida Li,Ash Parameswaran

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Systems and Control (eess.SY)

关键词:chemical assay samples, chemical assay sample, smartphone-based imaging system, imaging system capable, chemical assay

备注

点击查看摘要

Abstract:This paper presents a smartphone-based imaging system capable of quantifying the concentration of an assortment of biological/chemical assay samples. The main objective is to construct an image database which characterizes the relationship between color information and concentrations of the biological/chemical assay sample. For this aim, a designated optical setup combined with image processing and data analyzing techniques was implemented. A series of experiments conducted on selected assays, including fluorescein, RNA Mango, homogenized milk and yeast have demonstrated that the proposed system estimates the concentration of fluorescent materials and colloidal mixtures comparable to currently used commercial and laboratory instruments. Furthermore, by utilizing the camera and computational power of smartphones, eventual development can be directed toward extremely compact, inexpensive and portable analysis and diagnostic systems which will allow experiments and tests to be conducted in remote or impoverished areas.