本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新593篇论文,其中:

  • 自然语言处理107
  • 信息检索35
  • 计算机视觉102

自然语言处理

1. 【2604.21928】Evaluation of Automatic Speech Recognition Using Generative Large Language Models

链接https://arxiv.org/abs/2604.21928

作者:Thibault Bañeras-Roux,Shashi Kumar,Driss Khalil,Sergio Burdisso,Petr Motlicek,Shiran Liu,Mickael Rouvier,Jane Wottawa,Richard Dufour

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Word Error Rate, Automatic Speech, Speech Recognition, evaluated using Word

备注

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

2. 【2604.21916】MathDuels: Evaluating LLMs as Problem Posers and Solvers

链接https://arxiv.org/abs/2604.21916

作者:Zhiqiu Xu,Shibo Jin,Shreya Arya,Mayur Naik

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:attain near-ceiling performance, static mathematical benchmarks, language models attain, models attain near-ceiling, cast models solely

备注

点击查看摘要

Abstract:As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

3. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

链接https://arxiv.org/abs/2604.21911

作者:Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large

备注

点击查看摘要

Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .

4. 【2604.21901】GiVA: Gradient-Informed Bases for Vector-Based Adaptation

链接https://arxiv.org/abs/2604.21901

作者:Neeraj Gangwar,Rishabh Deshmukh,Michael Shavlovsky,Hancao Li,Vivek Mittal,Lexing Ying,Nickvash Kani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:model sizes continue, parameter-efficient fine-tuning, continue to grow, full fine-tuning, model sizes

备注: Accepted to AISTATS 2026

点击查看摘要

Abstract:As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).

5. 【2604.21897】Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

链接https://arxiv.org/abs/2604.21897

作者:Flávio Soriano,Victoria F. Mello,Pedro B. Rigueira,Gisele L. Pappa,Wagner Meira Jr.,Ana Paula Couto da Silva,Jussara M. Almeida

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:voting records, overlooking the rich, political speech, behavior often rely, rely on voting

备注: Accepted paper at ICWSM 2026

点击查看摘要

Abstract:Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.

6. 【2604.21890】EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

链接https://arxiv.org/abs/2604.21890

作者:Praval Sharma,Ashok Samal,Leen-Kiat Soh,Deepti Joshi

类目:Computation and Language (cs.CL)

关键词:Event extraction identifies, identifies the central, central aspects, Event extraction, Event

备注

点击查看摘要

Abstract:Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

7. 【2604.21889】ngIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

链接https://arxiv.org/abs/2604.21889

作者:Jun Wang,Ziyin Zhang,Rui Wang,Hang Yu,Peng Di,Rui Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large-scale cloud-native services, massive financial losses, diminished user trust, Real-time detection, cloud-native services

备注: Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

8. 【2604.21885】A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

链接https://arxiv.org/abs/2604.21885

作者:Praval Sharma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Event extraction, open-domain event extraction, Event, event extraction approaches, open-domain event

备注

点击查看摘要

Abstract:Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

9. 【2604.21882】Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

链接https://arxiv.org/abs/2604.21882

作者:Yuto Nishida,Naoki Shikoda,Yosuke Kishinami,Ryo Fujii,Makoto Morishita,Hidetaka Kamigaito,Taro Watanabe

类目:Computation and Language (cs.CL)

关键词:knowledge large language, Understanding what kinds, factual knowledge large, large language models, memorize is essential

备注: Accepted to ACL 2026 Main

点击查看摘要

Abstract:Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

10. 【2604.21871】Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

链接https://arxiv.org/abs/2604.21871

作者:Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Wenchao Dong,Jaehong Kim,Meeyoung Cha

类目:Computation and Language (cs.CL)

关键词:interpersonal relationships, context-dependent and modulated, modulated by interpersonal, predicted human behavior, predicted human

备注: ACL-Findings 2026

点击查看摘要

Abstract:Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

11. 【2604.21794】Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

链接https://arxiv.org/abs/2604.21794

作者:Ye Yu,Heming Liu,Haibo Jin,Xiaopeng Yuan,Peng Kuang,Haohan Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:large language models, shown strong performance, complex reasoning tasks, treating inter-agent communication, fixed interface

备注: Under review at COLM 2026

点击查看摘要

Abstract:Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

12. 【2604.21782】SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

链接https://arxiv.org/abs/2604.21782

作者:Hans Ole Hatzel,Ekaterina Artemova,Haimo Paul Stiemer,Evelyn Gius,Chris Biemann

类目:Computation and Language (cs.CL)

关键词:narrative representation learning, present the shared, narrative similarity, NSNRL, narrative

备注

点击查看摘要

Abstract:We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.

13. 【2604.21767】Misinformation Span Detection in Videos via Audio Transcripts

链接https://arxiv.org/abs/2604.21767

作者:Breno Matos,Rennan C. Lima,Savvas Zannettou,Fabricio Benevenuto,Rodrygo L.T. Santos

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:yielding severe consequences, public health risks, including political polarization, including online social, misinformation

备注: Accepted at ICWSM 2026

点击查看摘要

Abstract:Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video's misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video's misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video's audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos.

Comments:
Accepted at ICWSM 2026

Subjects:

Computation and Language (cs.CL); Social and Information Networks (cs.SI)

Cite as:
arXiv:2604.21767 [cs.CL]

(or
arXiv:2604.21767v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.21767

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2604.21766】AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

链接https://arxiv.org/abs/2604.21766

作者:Tasnim Kabir,Dmytro Kurdydyk,Aadi Palnitkar,Liam Dorn,Ahmed Haj Ahmed,Jordan Lee Boyd-Graber

类目:Computation and Language (cs.CL)

关键词:Internet Trivia Authors, Diverse Internet Trivia, Understanding from Diverse, Diverse Internet, surface-level acoustic recognition

备注

点击查看摘要

Abstract:Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

15. 【2604.21751】Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

链接https://arxiv.org/abs/2604.21751

作者:Joseba Fernandez de Landa,Carla Perez-Almendros,Jose Camacho-Collados

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Western and Anglocentric, Anglocentric viewpoints, amplifying Western, coverage and competence, showing limitations

备注

点击查看摘要

Abstract:LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ). The results show that, contrary to previous cultural bias work, LLMs show a clear tendency towards countries such as Japan. Moveover, our results show that when prompting in languages such as English or other high-resource ones, LLMs tend to provide more diverse outputs and show less inclinations towards answering questions highlighting countries for which the input language is an official language. Finally, we also investigate at which point of LLM training this cultural bias emerges, with our results suggesting that the first clear signs appear after supervised fine-tuning, and not during pre-training.

16. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs

链接https://arxiv.org/abs/2604.21748

作者:Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events

备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .

17. 【2604.21725】AEL: Agent Evolving Learning for Open-Ended Environments

链接https://arxiv.org/abs/2604.21725

作者:Wujiang Xu,Jiaojiao Han,Minghao Guo,Kai Mei,Xi Zhu,Han Zhang,Dimitris N. Metaxas

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:LLM agents increasingly, open-ended environments spanning, environments spanning hundreds, agents increasingly operate, remain largely stateless

备注

点击查看摘要

Abstract:LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: this https URL.

18. 【2604.21724】Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

链接https://arxiv.org/abs/2604.21724

作者:Yilong Chen,Yanxi Xie,Zitian Gao,He Xin,Yihao Xiao,Renbiao Liu,Haoming Luo,Yifan Luo,Zhengmao Ye,Tingwen Liu,Xin Zhao,Ran Tao,Bryan Dai

类目:Computation and Language (cs.CL)

关键词:Large token-indexed lookup, poor parameter efficiency, Large token-indexed, compute-decoupled scaling path, rapid memory growth

备注: 29 pages, 9 figures, 13 tables

点击查看摘要

Abstract:Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in this https URL.

19. 【2604.21718】Building a Precise Video Language with Human-AI Oversight

链接https://arxiv.org/abs/2604.21718

作者:Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:dynamic visual world, Video-language models, learn to reason, natural language, world through natural

备注: CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)

点击查看摘要

Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL

20. 【2604.21716】From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

链接https://arxiv.org/abs/2604.21716

作者:Minh Duc Bui,Xenia Heilmann,Mattia Cerrato,Manuel Mager,Katharina von der Wense

类目:Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Prior work evaluates, reveal solely overt, work evaluates code, evaluates code generation, Prior work

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

21. 【2604.21706】Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

链接https://arxiv.org/abs/2604.21706

作者:Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts

类目:Computation and Language (cs.CL)

关键词:self-supervised speech representations, frozen self-supervised speech, severity assessment based, speech representations, previously introduced

备注: Submitted to Computer Speech Language

点击查看摘要

Abstract:We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared 0.14, Holm-corrected p 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.

22. 【2604.21700】Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

链接https://arxiv.org/abs/2604.21700

作者:Jiali Wei,Ming Fan,Guoheng Sun,Xicheng Zhang,Haijun Wang,Ting Liu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:raised urgent concerns, large language models, growing application, application of large, large language

备注

点击查看摘要

Abstract:The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

23. 【2604.21698】Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

链接https://arxiv.org/abs/2604.21698

作者:Marius Huber,David R. Reich,Lena A. Jäger

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Algebraic Topology (math.AT)

关键词:extracts robust, time series, Persistent homology, features, Copenhagen Corpus

备注: ETRA 2026

点击查看摘要

Abstract:Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features. We empirically evaluate our method by applying it to the task of dyslexia detection from eye-tracking-while-reading data using the Copenhagen Corpus, which contains scanpaths from dyslexic and non-dyslexic L1 and L2 readers. Our hybrid models outperform existing approaches that rely solely on traditional features, showing that persistent homology captures complementary information encoded in fixation sequences. The strength of these topological features is further underscored by their achieving performance comparable to established baseline methods. Importantly, our proposed filtrations outperform existing ones.

24. 【2604.21667】Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

链接https://arxiv.org/abs/2604.21667

作者:Olufunke O. Sarumi,Charles Welch,Daniel Braun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:exploring disaggregated labels, User Passport mechanism, representation-level User Passport, exploring disaggregated, User Passport

备注: Accepted at 5th NLPerspectives Workshop

点击查看摘要

Abstract:Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

25. 【2604.21649】GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

链接https://arxiv.org/abs/2604.21649

作者:Qizhuo Xie,Yunhui Liu,Yu Xing,Qianzi Hou,Xudong Jin,Tao Zheng,Tieke He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shown immense potential, Knowledge Graph Completion, Large Language Models, LLM tokens remains, continuous graph embeddings

备注: ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at this https URL.

26. 【2604.21637】Multilinguality at the Edge: Developing Language Models for the Global South

链接https://arxiv.org/abs/2604.21637

作者:Lester James V. Miranda,Songbo Hu,Roi Reichart,Anna Korhonen

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:deployed determines, Global South, language models, prevent effective deployment, hardware constrained communities

备注

点击查看摘要

Abstract:Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed. To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. We also discuss open questions and provide actionable recommendations for different stakeholders in the NLP ecosystem. Finally, we hope that this work contributes to the development of inclusive and equitable language technologies.

27. 【2604.21611】Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

链接https://arxiv.org/abs/2604.21611

作者:Hao-Yuan Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLM reasoning, Verbal Process Supervision, chain depth, sample breadth, GPQA Diamond

备注

点击查看摘要

Abstract:Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

28. 【2604.21593】Language as a Latent Variable for Reasoning Optimization

链接https://arxiv.org/abs/2604.21593

作者:Linjuan Wu,Haoran Wei,Jialong Tang,Shuang Luo,Baosong Yang,Yongliang Shen,Weiming Lu

类目:Computation and Language (cs.CL)

关键词:reduce English-centric bias, LLMs reduce English-centric, surprising trend emerges, English-centric bias, reduce English-centric

备注: 17 pages, 7 figures, Under Reviewing

点击查看摘要

Abstract:As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

29. 【2604.21590】AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

链接https://arxiv.org/abs/2604.21590

作者:Yuanjie Lyu,Chengyu Wang,Haonan Zheng,Yuanhao Yue,Junbing Yan,Ming Wang,Jun Huang

类目:Computation and Language (cs.CL)

关键词:Modern industrial applications, increasingly demand language, demand language models, Modern industrial, capable of multi-step

备注

点击查看摘要

Abstract:Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: this https URL. Data synthesis and RL training code: this https URL. The data synthesis pipeline is also integrated into EasyDistill: this https URL.

30. 【2604.21564】Measuring Opinion Bias and Sycophancy via LLM-based Coercion

链接https://arxiv.org/abs/2604.21564

作者:Rodrigo Nogueira,Giovana Kerche Bonás,Thales Sales Almeida,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau

类目:Computation and Language (cs.CL)

关键词:Large language models, information people consume, Large language, language models increasingly, models increasingly shape

备注

点击查看摘要

Abstract:Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.

31. 【2604.21555】Finding Meaning in Embeddings: Concept Separation Curves

链接https://arxiv.org/abs/2604.21555

作者:Paul Keuren,Marc Ponsen,Robert Ayoub Bagheri

类目:Computation and Language (cs.CL)

关键词:embedding techniques aim, encode key concepts, Sentence embedding techniques, Concept Separation Curves, vector space

备注: The code is open source and located on github at [this https URL](https://github.com/pkun-cbs/ConceptSeparationCurves) . Original conference paper

点击查看摘要

Abstract:Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model's capacity to differentiate between conceptual and surface-level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross-model approach for evaluating the conceptual stability of sentence embeddings.

32. 【2604.21534】UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

链接https://arxiv.org/abs/2604.21534

作者:Darya Hryhoryeva,Amaia Zurinaga,Hamidreza Jamalabadi,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:paper presents, pairwise Maximum Entropy, task requires modeling, Maximum Entropy, presents our system

备注: Accepted to SemEval 2026 (co-located with ACL 2026)

点击查看摘要

Abstract:This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

33. 【2604.21525】Job Skill Extraction via LLM-Centric Multi-Module Framework

链接https://arxiv.org/abs/2604.21525

作者:Guojing Li(1 and 2),Zichuan Fu(1),Junyi Li(1),Faxue Liu(1),Wenxia Zhou(2),Yejing Wang(1),Jingtong Gao(1),Maolin Wang(1),Rungen Liu(1),Wenlin Zhang(1),Xiangyu Zhao(1) ((1) City University of Hong Kong, (2) Renmin University of China)

类目:Computation and Language (cs.CL)

关键词:Span-level skill extraction, job advertisements underpins, advertisements underpins candidate-job, underpins candidate-job matching, yield malformed spans

备注: 5 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.

34. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

链接https://arxiv.org/abs/2604.21523

作者:Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs

备注

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

35. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE

链接https://arxiv.org/abs/2604.21511

作者:Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse

备注: 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025

点击查看摘要

Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

36. 【2604.21510】OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

链接https://arxiv.org/abs/2604.21510

作者:Xinyu Zhang,Boxuan Zhang,Yuchen Wan,Lingling Zhang,YiXing Yao,Bifan Wei,Yaqiang Wu,Jun Liu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, demonstrate remarkable reasoning, Large Language, requiring domain knowledge, tasks remain challenging

备注

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

37. 【2604.21496】How English Print Media Frames Human-Elephant Conflicts in India

链接https://arxiv.org/abs/2604.21496

作者:Bonala Sai Punith,Salveru Jayati,Garima Shakya,Shubham Kumar Nigam

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:expanding human settlements, human settlements force, settlements force elephants, Human-elephant conflict, contact with people

备注

点击查看摘要

Abstract:Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

38. 【2604.21495】Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

链接https://arxiv.org/abs/2604.21495

作者:Hanjun Cho,Gahyun Yoo,Hanseong Kim,Jay-Yoon Lee

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:exhibits high in-domain, high in-domain accuracy, exhibits high, high in-domain, Numerical reasoning

备注: Accepted to TACL. This is a pre-MIT Press publication version

点击查看摘要

Abstract:Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner. By decoupling domain semantics and numerical operation structure, TaNOS improves the transferability of numerical reasoning. Applied to an 8B instruction-tuned model, TaNOS achieves 80.13% execution accuracy on FinQA with only 10% train data, outperforming SFT baseline (73.97%) with full train data and proprietary models such as GPT-5, Gemini-2.5-Pro. Furthermore, in the domain-shift experiments, TaNOS displays nearly-negligible cross-domain gap (2pp) when standard SFT shows over 10pp gap. These results suggest that structural guidance with operation sketches, header-agnostic representations, and correctness-guaranteed self-supervision can improve the robustness of numerical reasoning across diverse expert-domain tables.

39. 【2604.21481】Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

链接https://arxiv.org/abs/2604.21481

作者:Srija Anand,Ashwin Sankar,Ishvinder Sethi,Aaditya Pareek,Kartik Rajput,Gaurav Yadav,Nikhil Narasimhan,Adish Pandya,Deepon Halder,Mohammed Safi Ur Rahman Khan,Praveen S V,Shobhit Banga,Mitesh M Khapra

类目:Computation and Language (cs.CL)

关键词:Crowdsourced pairwise evaluation, assessing foundation models, Crowdsourced pairwise, scalable approach, approach for assessing

备注

点击查看摘要

Abstract:Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

40. 【2604.21469】Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

链接https://arxiv.org/abs/2604.21469

作者:Fariz Ikhwantri,Dusica Marijan

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:challenging task due, regulatory compliance remains, legal texts, remains a challenging, complexity and variability

备注: 10 pages, 5 figures, 4 tables. 11th Special Session on Intelligent Data Mining, 2025 IEEE International Conference on Big Data

点击查看摘要

Abstract:Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

41. 【2604.21454】Reasoning Primitives in Hybrid and Non-Hybrid LLMs

链接https://arxiv.org/abs/2604.21454

作者:Shivam Rawat,Lucie Flek,Florian Mai,Nicholas Kluge Corrêa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, monolithic capability, basic operations, large language, observed gains

备注

点击查看摘要

Abstract:Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model's effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.

42. 【2604.21446】AI-Gram: When Visual Agents Interact in a Social Network

链接https://arxiv.org/abs/2604.21446

作者:Andrew Shin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

关键词:enabling image-based interactions, live platform enabling, platform enabling image-based, fully autonomous multi-agent, image-based interactions

备注

点击查看摘要

Abstract:We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties. These results reveal a fundamental asymmetry in current agent architectures: strong expressive communication paired with a steadfast preservation of individual visual identity. We release AI-Gram as a publicly accessible, continuously evolving platform for studying social dynamics in Al-native multi-agent systems. this https URL

43. 【2604.21428】Decoupled DiLoCo for Resilient Distributed Pre-training

链接https://arxiv.org/abs/2604.21428

作者:Arthur Douillard,Keith Rush,Yani Donchev,Zachary Charles,Nova Fallen,Ayush Dubey,Ionel Gog,Josef Dean,Blake Woodworth,Zachary Garrett,Nate Keating,Jenny Bishop,Henry Prior,Edouard Yvinec,Arthur Szlam,Marc'Aurelio Ranzato,Jeff Dean

类目:Computation and Language (cs.CL)

关键词:Modern large-scale language, pre-training relies heavily, requires tight coupling, Modern large-scale, program multiple data

备注

点击查看摘要

Abstract:Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.

44. 【2604.21421】Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

链接https://arxiv.org/abs/2604.21421

作者:Michele Miranda,Xinlan Yan,Nishant Mishra,Rachel Murphy,Ameen Abu-Hanna,Sébastien Bratières,Iacer Calixto

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:GDPR and HIPAA, Protecting patient privacy, Protecting patient, narratives is essential, essential for enabling

备注

点击查看摘要

Abstract:Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

45. 【2604.21380】Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

链接https://arxiv.org/abs/2604.21380

作者:Wang Shi Hai,Chen Tao

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:software performance requirements, software engineering, performance requirements, natural language, documented in natural

备注: 9 pages,accepted by ACL 2026

点击查看摘要

Abstract:Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.

46. 【2604.21375】VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

链接https://arxiv.org/abs/2604.21375

作者:Qijun Han,Haoqin Tu,Zijun Wang,Haoyue Dai,Yiyang Zhou,Nancy Lau,Alvaro A. Cardenas,Yuhui Xu,Ran Xu,Caiming Xiong,Zeyu Zheng,Huaxiu Yao,Yuyin Zhou,Cihang Xie

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词:Autonomous GUI agents, Autonomous GUI, GUI agents face, agents prematurely declare, prematurely declare success

备注: The first two authors contribute equally

点击查看摘要

Abstract:Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

47. 【2604.21370】MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

链接https://arxiv.org/abs/2604.21370

作者:Maziar Kianimoghadam Jouneghani

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:multilingual polarization detection, contrasting multilingual generalists, present a systematic, systematic study, polarization detection

备注: 9 pages, 9 tables. Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), Task 9

点击查看摘要

Abstract:We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: this https URL.

48. 【2604.21365】mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code

链接https://arxiv.org/abs/2604.21365

作者:Adam Skurla,Dominik Macko,Jakub Simko

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Multi-domain detection, challenging task, programming languages, machine-generated code snippets, Multi-domain

备注

点击查看摘要

Abstract:Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task~13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.

49. 【2604.21357】ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

链接https://arxiv.org/abs/2604.21357

作者:Jian Cui,Zhiyuan Ren,Desheng Weng,Yongqi Zhao,Gong Wenbin,Yu Lei,Zhenning Dong

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:including workflow complexity, traditional multi-stage approaches, vector similarity retrieval, geographic knowledge bases, structured geographic knowledge

备注: 12 pages, 8 figures, submitted to ACM SIGSPATIAL 2024 (under review)

点击查看摘要

Abstract:This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.

50. 【2604.21352】CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

链接https://arxiv.org/abs/2604.21352

作者:Hagai Astrin,Ayal Swaid,Avi Segal,Kobi Gal

类目:Computation and Language (cs.CL)

关键词:Mental health challenges, increasing worldwide, challenges are increasing, services and leading, Mental health

备注: 9 pages, 4 figures

点击查看摘要

Abstract:Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts.

Comments:
9 pages, 4 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.21352 [cs.CL]

(or
arXiv:2604.21352v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.21352

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
51. 【2604.21346】Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

链接https://arxiv.org/abs/2604.21346

作者:Mohit Vaishnav,Tanel Tammet

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:main bottleneck lies, Bongard problems, language models, large language models, raising the question

备注

点击查看摘要

Abstract:Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

52. 【2604.21345】Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

链接https://arxiv.org/abs/2604.21345

作者:Philip Zhong,Don Wang,Jason Zhang,Kent Chen

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reusable evaluation pipeline, public artifact package, artifact package derived, Dataset Pipeline, summaries and released

备注: AI Application Feature Quality Evaluation (28 pages total)

点击查看摘要

Abstract:We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.

Comments:
AI Application Feature Quality Evaluation (28 pages total)

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.21345 [cs.AI]

(or
arXiv:2604.21345v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.21345

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2604.21344】Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

链接https://arxiv.org/abs/2604.21344

作者:Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:present complex information, complex information, present complex, Multimodal Language Models, multiple related charts

备注

点击查看摘要

Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

54. 【2604.21335】Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

链接https://arxiv.org/abs/2604.21335

作者:Wei Jiang,Wei Wang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:finer control axis, prior work, offers a finer, finer control, control axis

备注: 16 pages, 14 tables, 2 figures

点击查看摘要

Abstract:Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.

55. 【2604.21334】Ideological Bias in LLMs' Economic Causal Reasoning

链接https://arxiv.org/abs/2604.21334

作者:Donggyu Lee,Hyeok Yun,Jungwon Kim,Junsik Min,Sungwon Park,Sangyoon Park,Jihee Kim

类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)

关键词:large language models, large language, bias when reasoning, exhibit systematic ideological, systematic ideological bias

备注

点击查看摘要

Abstract:Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

56. 【2604.21327】Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

链接https://arxiv.org/abs/2604.21327

作者:Yongcan Yu,Lingxiao He,Jian Liang,Kuangpu Guo,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Test-time reinforcement learning, reinforcement learning, time via pseudo-labeling, leaving it vulnerable, Denoised test-time Reinforcement

备注: Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at this https URL.

57. 【2604.21309】When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

链接https://arxiv.org/abs/2604.21309

作者:Nannan Huang,Iffat Maab,Junichi Yamagishi

类目:Computation and Language (cs.CL)

关键词:processing vast daily, political perspectives critical, daily news content, diverse political perspectives, Multi-document news summarisation

备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.

58. 【2604.21308】CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

链接https://arxiv.org/abs/2604.21308

作者:Wenjie Fu,Xiaoting Qin,Jue Zhang,Qingwei Lin,Lukas Wutschitz,Robert Sim,Saravan Rajmohan,Dongmei Zhang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:improve workplace productivity, dramatically improve workplace, Enterprise LLM agents, LLM agents, workplace productivity

备注

点击查看摘要

Abstract:Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user's behalf, also creates new risks for sensitive information leakage. We introduce CI-Work, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.

59. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

链接https://arxiv.org/abs/2604.21300

作者:Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE

备注

点击查看摘要

Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{this https URL} \footnote{this https URL}.

60. 【2604.21286】Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

链接https://arxiv.org/abs/2604.21286

作者:Jon-Paul Cacioli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:K-way energy probe, discriminative predictive coding, predictive coding networks, coding networks reduces, standard discriminative predictive

备注: 11 pages, 3 figures, 4 tables. Pre-registered on OSF ( [this https URL](https://osf.io/2kvsp) ). Code at [this https URL](https://github.com/synthiumjp/ima)

点击查看摘要

Abstract:Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use "metacognitive" operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.

61. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

链接https://arxiv.org/abs/2604.21284

作者:Robin Dey,Panyanon Viradecha

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci

备注: 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)

点击查看摘要

Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.

62. 【2604.21276】Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

链接https://arxiv.org/abs/2604.21276

作者:Srishti Ginjala,Eric Fosler-Lussier,Christopher W. Myers,Srinivasan Parthasarathy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:critical question arises, text-derived priors make, models replace task-specific, large language models, language models replace

备注

点击查看摘要

Abstract:As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.

63. 【2604.21265】Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

链接https://arxiv.org/abs/2604.21265

作者:Yoshinori Nomura

类目:Computation and Language (cs.CL)

关键词:accelerates language acquisition, language significantly accelerates, significantly accelerates language, significantly accelerates, language acquisition

备注: 17 pages, 3 figures

点击查看摘要

Abstract:We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5\%$ perplexity improvement over random initialization ($p 0.001$, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d\!=\!64$, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ($p = 0.017$), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ($-3\% \to +3\% \to +6\%$ advantage of larger datasets from $d\!=\!16$ to $d\!=\!64$). Across the scales we study ($d\!\in\!\{16,32,64\}$, up to ${\sim}400$K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

64. 【2604.21255】When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

链接https://arxiv.org/abs/2604.21255

作者:Chenghao Yang,Yuning Zhang,Zhoufutu Wen,Tao Gong,Jiaheng Liu,Qi Chu,Nenghai Yu

类目:Computation and Language (cs.CL)

关键词:progress of LLM, LLM agents, primary driver, rapid progress, Action Graph Similarity

备注: Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $\tau$-Bench and $\tau^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at this https URL.

65. 【2604.21254】Hyperloop Transformers

链接https://arxiv.org/abs/2604.21254

作者:Abbas Zeitoun,Lucas Torroba-Hennigen,Yoon Kim

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:research generally aims, maximize model quality, model quality subject, architecture research generally, LLM architecture research

备注

点击查看摘要

Abstract:LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

66. 【2604.21253】Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation

链接https://arxiv.org/abs/2604.21253

作者:Hanwen Gu,Chao Guo,Junle Wang,Wenda Xie,Yisheng Lv

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:producing monotonous scripts, contextual logical consistency, existing methods struggle, smooth character development, global narrative coherence

备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.

67. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching

链接https://arxiv.org/abs/2604.21238

作者:Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity

备注: Accepted by NLPCC 2025

点击查看摘要

Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.

68. 【2604.21235】Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness

链接https://arxiv.org/abs/2604.21235

作者:Zihan Liang,Ziwen Pan,Ruoxuan Xiong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)

关键词:offering rich temporal, rich temporal information, offering rich, Multimodal clinical records, rich temporal

备注: Findings of ACL 2026 (30 pages)

点击查看摘要

Abstract:Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient's latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.

69. 【2604.21229】EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

链接https://arxiv.org/abs/2604.21229

作者:Julian Acuna

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language model, Large language, language model assistants, assistants are increasingly, increasingly expected

备注: 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

70. 【2604.21223】Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

链接https://arxiv.org/abs/2604.21223

作者:Runheng Liu,Heyan Huang,Xingchen Xiao,Zhijing Wu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, demonstrated remarkable capabilities, Large language, demonstrated remarkable, remarkable capabilities

备注: NeurIPS 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

71. 【2604.21211】Subject-level Inference for Realistic Text Anonymization Evaluation

链接https://arxiv.org/abs/2604.21211

作者:Myeong Seok Oh,Dong-Yun Kim,Hanseok Oh,Chaean Kang,Joeun Kang,Xiaonan Wang,Hyunjung Park,Young Cheol Jung,Hansaem Kim

类目:Computation and Language (cs.CL)

关键词:ignoring multi-subject scenarios, Current text anonymization, Current text, PII Inference Assessment, single data subject

备注: Accepted at ACL 2026

点击查看摘要

Abstract:Current text anonymization evaluation relies on span-based metrics that fail to capture what an adversary could actually infer, and assumes a single data subject, ignoring multi-subject scenarios. To address these limitations, we present SPIA (Subject-level PII Inference Assessment), the first benchmark that shifts the unit of evaluation from text spans to individuals, comprising 675 documents across legal and online domains with novel subject-level protection metrics. Extensive experiments show that even when over 90% of PII spans are masked, subject-level inference protection drops as low as 33%, leaving the majority of personal information recoverable through contextual inference. Furthermore, target-subject-focused anonymization leaves non-target subjects substantially more exposed than the target subject. We show that subject-level inference-based evaluation is essential for ensuring safe text anonymization in real-world settings.

72. 【2604.21209】Align Generative Artificial Intelligence with Human Preferences: A Novel Large Language Model Fine-Tuning Method for Online Review Management

链接https://arxiv.org/abs/2604.21209

作者:Yanan Wang,Yong Ge

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:consumers' decision-making processes, Online reviews, domain-specific human preferences, preference finetuning, decision-making processes

备注: Accepted to Information Systems Research (ISR). This is a preliminary version

点击查看摘要

Abstract:Online reviews have played a pivotal role in consumers' decision-making processes. Existing research has highlighted the significant impact of managerial review responses on customer relationship management and firm performance. However, a large portion of online reviews remains unaddressed due to the considerable human labor required to respond to the rapid growth of online reviews. While generative AI has achieved remarkable success in a range of tasks, they are general-purpose models and may not align well with domain-specific human preferences. To tailor these general generative AI models to domain-specific applications, finetuning is commonly employed. Nevertheless, several challenges persist in finetuning with domain-specific data, including hallucinations, difficulty in representing domain-specific human preferences, and over conservatism in offline policy optimization. To address these challenges, we propose a novel preference finetuning method to align an LLM with domain-specific human preferences for generating online review responses. Specifically, we first identify the source of hallucination and propose an effective context augmentation approach to mitigate the LLM hallucination. To represent human preferences, we propose a novel theory-driven preference finetuning approach that automatically constructs human preference pairs in the online review domain. Additionally, we propose a curriculum learning approach to further enhance preference finetuning. To overcome the challenge of over conservatism in existing offline preference finetuning method, we propose a novel density estimation-based support constraint method to relax the conservatism, and we mathematically prove its superior theoretical guarantees. Extensive evaluations substantiate the superiority of our proposed preference finetuning method.

73. 【2604.21204】On Reasoning Behind Next Occupation Recommendation

链接https://arxiv.org/abs/2604.21204

作者:Shan Dong,Palakorn Achananuparp,Hieu Hien Mai,Lei Wang,Yao Lu,Ee-Peng Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, occupation prediction, occupation, language models, future occupation prediction

备注: Accepted to PAKDD 2026

点击查看摘要

Abstract:In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at this https URL.

74. 【2604.21191】Prefix Parsing is Just Parsing

链接https://arxiv.org/abs/2604.21191

作者:Clemente Pasti,Andreas Opedal,Timothy J. O'Donnell,Ryan Cotterell,Tim Vieira

类目:Computation and Language (cs.CL)

关键词:complete string generated, Prefix parsing, Prefix, parsing, grammar

备注: To appear at ACL 2026

点击查看摘要

Abstract:Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling, psycholinguistic analysis, and syntactically constrained generation from large language models. We introduce the prefix grammar transformation, an efficient reduction of prefix parsing to ordinary parsing. Given a grammar, our method constructs another grammar that generates exactly the prefixes of its original strings. Prefix parsing is then solved by applying any ordinary parsing algorithm on the transformed grammar without modification. The reduction is both elegant and practical: the transformed grammar is only a small factor larger than the input, and any optimized implementation can be used directly, eliminating the need for bespoke prefix-parsing algorithms. We also present a strategy-based on algorithmic differentiation-for computing the next-token weight vector, i.e., the prefix weights of all one-token extensions, enabling efficient prediction of the next token. Together, these contributions yield a simple, general, and efficient framework for prefix parsing.

75. 【2604.21159】Adaptive Instruction Composition for Automated LLM Red-Teaming

链接https://arxiv.org/abs/2604.21159

作者:Jesse Zymet,Andy Luo,Swapnil Shinde,Sahil Wadhwa,Emily Chen

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLM red-teaming leverage, LLM red-teaming, attacker LLM, red-teaming leverage, LLM

备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.

76. 【2604.21152】Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

链接https://arxiv.org/abs/2604.21152

作者:Irti Haq,Belén Saldías

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, ensuring equitable performance, Large

备注: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25--28, 2026, Montreal, Canada. ACM, New York, NY, USA, 32 pages

点击查看摘要

Abstract:As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.

77. 【2604.21148】"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

链接https://arxiv.org/abs/2604.21148

作者:Siyu Liang,Alicia Beckford Wassink

类目:Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Speech Recognition, Automatic Speech, reporting error rates, shape users' lived

备注

点击查看摘要

Abstract:Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users' lived experiences, how do users feel about and react to them, and what emotional toll do these repeated failures exact? We conducted user experience studies across four U.S. locations (Atlanta, Gulf Coast, Miami Beach, and Tucson) representing distinct English dialect communities. Our findings reveal that most participants report technologies fail to consider their cultural backgrounds and require constant adjustment to achieve basic functionality. Despite these experiences, participants maintain high expectations for ASR performance and express strong willingness to contribute to model improvement. Qualitative analysis of open-ended narratives exposes the deeper costs of these failures. Participants report frustration, annoyance, and feelings of inadequacy, yet the emotional impact extends beyond momentary reactions. Participants recognize that systems were not designed for them, yet often internalize failures as personal inadequacy despite this critical awareness. They perform extensive invisible labor, including code-switching, hyper-articulation, and emotional management, to make failing systems functional. Meanwhile, their linguistic and cultural knowledge remains unrecognized by technologies that encode particular varieties as standard while rendering others marginal. These findings demonstrate that algorithmic fairness assessments based on accuracy metrics alone miss critical dimensions of harm: the emotional labor of managing repeated technological rejection, the cognitive burden of constant self-monitoring, and the psychological toll of feeling inadequate in one's native language variety.

78. 【2604.21144】Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

链接https://arxiv.org/abs/2604.21144

作者:Biswesh Mohapatra,Giovanni Duca,Laurent Romary,Justine Cassell

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Situated dialogue requires, dialogue requires speakers, isolated utterances, Situated dialogue, requires speakers

备注: Work under review. Biswesh Mohapatra and Giovanni Duca both contributed equally to this work

点击查看摘要

Abstract:Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.

79. 【2604.21139】Slot Machines: How LLMs Keep Track of Multiple Entities

链接https://arxiv.org/abs/2604.21139

作者:Paul C. Bogdan,Jack Lindsey

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Language models, attributes they possess, possess and maintain, prior-entity slot, single token

备注

点击查看摘要

Abstract:Language models must bind entities to the attributes they possess and maintain several such binding relationships within a context. We study how multiple entities are represented across token positions and whether single tokens can carry bindings for more than one entity. We introduce a multi-slot probing approach that disentangles a single token's residual stream activation to recover information about both the currently described entity and the immediately preceding one. These two kinds of information are encoded in separate and largely orthogonal "current-entity" and "prior-entity" slots. We analyze the functional roles of these slots and find that they serve different purposes. In tandem with the current-entity slot, the prior-entity slot supports relational inferences, such as entity-level induction ("who came after Alice in the story?") and conflict detection between adjacent entities. However, only the current-entity slot is used for explicit factual retrieval questions ("Is anyone in the story tall?" "What is the tall entity's name?") despite these answers being linearly decodable from the prior-entity slot too. Consistent with this limitation, open-weight models perform near chance accuracy at processing syntax that forces two subject-verb-object bindings on a single token (e.g., "Alice prepares and Bob consumes food.") Interestingly, recent frontier models can parse this properly, suggesting they may have developed more sophisticated binding strategies. Overall, our results expose a gap between information that is available in activations and information the model actually uses, and suggest that the current/prior-entity slot structure is a natural substrate for behaviors that require holding two perspectives at once, such as sycophancy and deception.

80. 【2604.21137】Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

链接https://arxiv.org/abs/2604.21137

作者:Jiho Noh,Mukhesh Raghava Katragadda,Raymond Carl,Soon Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains prohibitively labor-intensive, understanding knowledge construction, knowledge construction mechanism, improving instructional practice, scale remains prohibitively

备注

点击查看摘要

Abstract:Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

81. 【2604.21134】Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

链接https://arxiv.org/abs/2604.21134

作者:Yiyang Lu,Woong Shin,Ahmad Maroof Karimi,Feiyi Wang,Jie Ren,Evgenia Smirni

类目:Computation and Language (cs.CL)

关键词:Vision-Language Models, confuse overlapping elements, hallucinate details, frequently misread, Interactive Visual Grounding

备注: 18 pages, 8 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.

82. 【2604.21133】GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons

链接https://arxiv.org/abs/2604.21133

作者:Sebastian Walter,Hannah Bast

类目:Computation and Language (cs.CL)

关键词:Guided Recurrent IRI, Recurrent IRI Selection, Guided Recurrent, Recurrent IRI, IRI Selection

备注

点击查看摘要

Abstract:We present GRISP (Guided Recurrent IRI Selection over SPARQL Skeletons), a novel SPARQL-based question-answering method over knowledge graphs based on fine-tuning a small language model (SLM). Given a natural-language question, the method first uses the SLM to generate a natural-language SPARQL query skeleton, and then to re-rank and select knowledge graph items to iteratively replace the natural-language placeholders using knowledge graph constraints. The SLM is jointly trained on skeleton generation and list-wise re-ranking data generated from standard question-query pairs. We evaluate the method on common Wikidata and Freebase benchmarks, and achieve better results than other state-of-the-art methods in a comparable setting.

83. 【2604.21131】Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

链接https://arxiv.org/abs/2604.21131

作者:Ari Azarafrooz

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:sessions slips past, AI-agent guardrails, guardrails are memoryless, judged in isolation, carries the payload

备注: 46 pages, 8 figures. Dataset: [this https URL](https://huggingface.co/datasets/intrinsec-ai/cstm-bench)

点击查看摘要

Abstract:AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session-bound detector because only the aggregate carries the payload. We make three contributions to cross-session threat detection. (1) Dataset. CSTM-Bench is 26 executable attack taxonomies classified by kill-chain stage and cross-session operation (accumulate, compose, launder, inject_on_reader), each bound to one of seven identity anchors that ground-truth "violation" as a policy predicate, plus matched Benign-pristine and Benign-hard confounders. Released on Hugging Face as intrinsec-ai/cstm-bench with two 54-scenario splits: dilution (compositional) and cross_session (12 isolation-invisible scenarios produced by a closed-loop rewriter that softens surface phrasing while preserving cross-session artefacts). (2) Measurement. Framing cross-session detection as an information bottleneck to a downstream correlator LLM, we find that a session-bound judge and a Full-Log Correlator concatenating every prompt into one long-context call both lose roughly half their attack recall moving from dilution to cross_session, well inside any frontier context window. Scope: 54 scenarios per shard, one correlator family (Anthropic Claude), no prompt optimisation; we release it to motivate larger, multi-provider datasets. (3) Algorithm and metric. A bounded-memory Coreset Memory Reader retaining highest-signal fragments at $K=50$ is the only reader whose recall survives both shards. Because ranker reshuffles break KV-cache prefix reuse, we promote $\mathrm{CSR\_prefix}$ (ordered prefix stability, LLM-free) to a first-class metric and fuse it with detection into $\mathrm{CSTM} = 0.7 F_1(\mathrm{CSDA@action}, \mathrm{precision}) + 0.3 \mathrm{CSR\_prefix}$, benchmarking rankers on a single Pareto of recall versus serving stability.

Comments:
46 pages, 8 figures. Dataset: this https URL

Subjects:

Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2604.21131 [cs.CR]

(or
arXiv:2604.21131v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2604.21131

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ari Azarafrooz [view email] [v1]
Wed, 22 Apr 2026 22:40:31 UTC (3,180 KB)

84. 【2604.21120】abSHAP

链接https://arxiv.org/abs/2604.21120

作者:Aryan Chaudhary,Prateek Agarwal,Tejasvi Alladi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, traditional tree-based models, Language Models, context-rich datasets

备注

点击查看摘要

Abstract:Large Language Models (LLMs) fine-tuned on serialized tabular data are emerging as powerful alternatives to traditional tree-based models, particularly for heterogeneous or context-rich datasets. However, their deployment in high-stakes domains is hindered by a lack of faithful interpretability; existing methods often rely on global linear proxies or scalar probability shifts that fail to capture the model's full probabilistic uncertainty. In this work, we introduce TabSHAP, a model-agnostic interpretability framework designed to directly attribute local query decision logic in LLM-based tabular classifiers. By adapting a Shapley-style sampled-coalition estimator with Jensen-Shannon divergence between full-input and masked-input class distributions, TabSHAP quantifies the distributional impact of each feature rather than simple prediction flips. To align with tabular semantics, we mask at the level of serialized key:value fields (atomic in the prompt string), not individual subword tokens. Experimental validation on the Adult Income and Heart Disease benchmarks demonstrates that TabSHAP isolates critical diagnostic features, achieving significantly higher faithfulness than random baselines and XGBoost proxies. We further run a distance-metric ablation on the same test instances and TabSHAP settings: attributions are recomputed with KL or L1 replacing JSD in the similarity step (results cached per metric), and we compare deletion faithfulness across all three.

85. 【2604.21108】Machine learning and digital pragmatics: Which word category influences emoji use most?

链接https://arxiv.org/abs/2604.21108

作者:Mohammed Q. Shormani,Ibrahim Abdulmalik Hassan Muneef Y. Alshawsh

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Arabic tweets employing, investigates Machine Learning, URL via Python, representing multiple Arabic, multiple Arabic colloquial

备注: 15 pages, 4 Figures, 3 Tables

点击查看摘要

Abstract:This study investigates Machine Learning (ML) in the prediction of emojis in Arabic tweets employing the (state-of-the-art) MARBERT model. A corpus of 11379 CA tweets representing multiple Arabic colloquial dialects was collected from this http URL via Python. A net dataset includes 8695 tweets, which were utilized for the analysis. These tweets were then classified into 14 categories, which were numerically encoded and used as labels. A preprocessing pipeline was designed as an interpretable baseline, allowing us to examine the relationship between lexical features and emoji categories. MARBERT was finetuned to predict emoji use from textual input. We evaluated the model performance in terms of precision, recall and F1-scores. Findings reveal that the model performed quite well with an overall accuracy 0.75. The study concludes that although the findings are promising, there is still a need for improving machine learning models including MARBERT, specifically for low-resource and multidialectal languages like Arabic.

86. 【2604.21106】How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

链接https://arxiv.org/abs/2604.21106

作者:Kristian Schwethelm,Daniel Rueckert,Georgios Kaissis

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:varphi, equivalent unique parameters, unique parameters, extra recurrence, language model

备注

点击查看摘要

Abstract:We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-\alpha} + B\,D^{-\beta}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.

87. 【2604.21098】Propensity Inference: Environmental Contributors to LLM Behaviour

链接https://arxiv.org/abs/2604.21098

作者:Olli Järviniemi,Oliver Makins,Jacob Merizian,Robert Kirk,Ben Millwood

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Motivated by loss, measuring language models', misaligned AI systems, loss of control, methods for measuring

备注

点击查看摘要

Abstract:Motivated by loss of control risks from misaligned AI systems, we develop and apply methods for measuring language models' propensity for unsanctioned behaviour. We contribute three methodological improvements: analysing effects of changes to environmental factors on behaviour, quantifying effect sizes via Bayesian generalised linear models, and taking explicit measures against circular analysis. We apply the methodology to measure the effects of 12 environmental factors (6 strategic in nature, 6 non-strategic) and thus the extent to which behaviour is explained by strategic aspects of the environment, a question relevant to risks from misalignment. Across 23 language models and 11 evaluation environments, we find approximately equal contributions from strategic and non-strategic factors for explaining behaviour, do not find strategic factors becoming more or less influential as capabilities improve, and find some evidence for a trend for increased sensitivity to goal conflicts. Finally, we highlight a key direction for future propensity research: the development of theoretical frameworks and cognitive models of AI decision-making into empirically testable forms.

88. 【2604.21096】Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

链接https://arxiv.org/abs/2604.21096

作者:Xuhong He,To Eun Kim,Maik Fröbe,Jaime Arguello,Bhaskar Mitra,Fernando Diaz

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:limiting their applicability, multilingual information access, largely focused, English, ToT

备注: SIGIR 2026; NTCIR track: [this https URL](https://ntcir-tot.github.io)

点击查看摘要

Abstract:Tip-of-the-Tongue (ToT) retrieval benchmarks have largely focused on English, limiting their applicability to multilingual information access. In this work, we construct multilingual ToT test collections for Chinese, Japanese, Korean, and English, using an LLM-based query simulation framework. We systematically study how prompt language and source document language affect the fidelity of simulated ToT queries, validating synthetic queries through system rank correlation against real user queries. Our results show that effective ToT simulation requires language-aware design choices: non-English language sources are generally important, while English Wikipedia can be beneficial when non-English sources provide insufficient information for query generation. Based on these findings, we release four ToT test collections with 5,000 queries per language across multiple domains. This work provides the first large-scale multilingual ToT benchmark and offers practical guidance for constructing realistic ToT datasets beyond English.

89. 【2604.21082】Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting

链接https://arxiv.org/abs/2604.21082

作者:Alexander Weers,Daniel Rueckert,Martin J. Menten

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Training vision-language models, high-quality annotated data, vision-language models, scarcity of high-quality, high-quality annotated

备注

点击查看摘要

Abstract:Training vision-language models (VLMs) for medical report generation is often hindered by the scarcity of high-quality annotated data. This work evaluates the use of a weighted loss function to improve data efficiency. Compared to standard cross-entropy loss, which treats all token prediction errors equally, the reweighted loss shifts the focus to semantically salient tokens with outsized clinical importance. In experiments on ophthalmological report generation, we show that this simple method improves efficiency across multiple data scales, achieving similar report quality with up to ten times less training data.

90. 【2604.21076】Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

链接https://arxiv.org/abs/2604.21076

作者:Sanjoy Pator

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Raw JSON, Clinical Narrative, error-prone process, outperforms Raw JSON, Narrative outperforms Raw

备注: 14 pages, 7 figures, independent research

点击查看摘要

Abstract:Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS this http URL instance (NVIDIA L40S, 48 GB VRAM).

91. 【2604.21070】DWTSumm: Discrete Wavelet Transform for Document Summarization

链接https://arxiv.org/abs/2604.21070

作者:Rana Salama,Abdou Youssef,Mona Diab

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:remains challenging due, Discrete Wavelet Transform, Summarizing long, information loss, remains challenging

备注

点击查看摘要

Abstract:Summarizing long, domain-specific documents with large language models (LLMs) remains challenging due to context limitations, information loss, and hallucinations, particularly in clinical and legal settings. We propose a Discrete Wavelet Transform (DWT)-based multi-resolution framework that treats text as a semantic signal and decomposes it into global (approximation) and local (detail) components. Applied to sentence- or word-level embeddings, DWT yields compact representations that preserve overall structure and critical domain-specific details, which are used directly as summaries or to guide LLM generation. Experiments on clinical and legal benchmarks demonstrate comparable ROUGE-L scores. Compared to a GPT-4o baseline, the DWT based summarization consistently improve semantic similarity and grounding, achieving gains of over 2% in BERTScore, more than 4\% in Semantic Fidelity, factual consistency in legal tasks, and large METEOR improvements indicative of preserved domain-specific semantics. Across multiple embedding models, Fidelity reaches up to 97%, suggesting that DWT acts as a semantic denoising mechanism that reduces hallucinations and strengthens factual grounding. Overall, DWT provides a lightweight, generalizable method for reliable long-document and domain-specific summarization with LLMs.

92. 【2604.21057】RACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

链接https://arxiv.org/abs/2604.21057

作者:Yannis Belkhiter,Seshu Tirupathi,Giulio Zizzo,John D. Kelleher

类目:Computation and Language (cs.CL)

关键词:Language Reasoning Models, field of Language, Reasoning Models, Language Reasoning, reason longer

备注

点击查看摘要

Abstract:The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

93. 【2604.21045】Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

链接https://arxiv.org/abs/2604.21045

作者:Siqi Ouyang,Shuoyang Ding,Oleksii Hrinchuk,Vitaly Lavrukhin,Brian Yan,Boris Ginsburg,Lei Li

类目:Computation and Language (cs.CL)

关键词:partial speech input, Simultaneous speech translation, receiving partial speech, Simultaneous speech, speech input

备注: ACL 2026 Oral

点击查看摘要

Abstract:Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here this https URL

94. 【2604.20996】AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

链接https://arxiv.org/abs/2604.20996

作者:Tadesse Destaw Belay,Shahriar Kabir Nahin,Israel Abebe Azime,Ocean Monjur,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam,Anshuman Chhabra

类目:Computation and Language (cs.CL)

关键词:lack sufficient training, lack sufficient, Direct Preference Optimization, language learning systems, sufficient training resources

备注

点击查看摘要

Abstract:How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at this https URL.

95. 【2604.20995】Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

链接https://arxiv.org/abs/2604.20995

作者:Inderjeet Nair,Jie Ruan,Lu Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:tools remain limited, poorly understood phenomenon, current diagnostic tools, diagnostic tools remain, model behaves aligned

备注: Under submission at COLM 2026 Won the Best Student Paper Award at MSLD 2026 @ UIUC

点击查看摘要

Abstract:Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model's strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of this http URL, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.

96. 【2604.20994】Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

链接https://arxiv.org/abs/2604.20994

作者:Yannis Belkhiter,Giulio Zizzo,Sergio Maffeis,Seshu Tirupathi,John D. Kelleher

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:calling Large Language, Large Language Models, Large Language, drawn significant attention, function calling Large

备注

点击查看摘要

Abstract:The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.

97. 【2604.20983】hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

链接https://arxiv.org/abs/2604.20983

作者:Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Vision evaluations, Vision, multi-step processes, visual, Multimodal Large Language

备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

98. 【2604.20917】he Path Not Taken: Duality in Reasoning about Program Execution

链接https://arxiv.org/abs/2604.20917

作者:Eshgin Hasanov,Md Mahadi Hassan Sibat,Santu Karmaker,Aashish Yadavally

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)

关键词:Large language models, shown remarkable capabilities, Large language, diverse coding tasks, shown remarkable

备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.

99. 【2604.20915】Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

链接https://arxiv.org/abs/2604.20915

作者:Zhixin Zhang,Shabo Zhang,Chengcan Wu,Zeming Wei,Meng Sun

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE); Optimization and Control (math.OC)

关键词:high computational cost, long streams prohibited, Transformers suffer, length for self-attention, high computational

备注

点击查看摘要

Abstract:Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.

100. 【2604.20878】AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

链接https://arxiv.org/abs/2604.20878

作者:Zijin Zhou,Songan Zhang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Traffic Accident Detection, Traffic Accident Understanding, achieved remarkable progress, Multimodal Large Language, Accident Detection

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi-step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain-of-Thought (MCoT) mechanism and integrates legal knowledge through Retrieval-Augmented Generation (RAG). We further present DecaTARA, a decathlon-style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question-answer pairs. Extensive experiments show that AITP achieves state-of-the-art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning-driven multimodal traffic analysis.

101. 【2604.20874】he Root Theorem of Context Engineering

链接https://arxiv.org/abs/2604.20874

作者:Borja Odriozola Schick

类目:Computational Complexity (cs.CC); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)

关键词:large language model, language model conversation, information quality degrades, single session faces, Root Theorem

备注: 17 pages, 2 figures

点击查看摘要

Abstract:Every system that maintains a large language model conversation beyond a single session faces two inescapable constraints: the context window is finite, and information quality degrades with accumulated volume. We formalize these constraints as axioms and derive a single governing principle -- the Root Theorem of Context Engineering: \emph{maximize signal-to-token ratio within bounded, lossy channels.} From this principle, we derive five consequences without additional assumptions: (1)~a quality function $F(P)$ that degrades monotonically with injected token volume, independent of window size; (2)~the independence of signal and token count as optimization variables; (3)~a necessary gate mechanism triggered by fidelity thresholds, not capacity limits; (4)~the inevitability of homeostatic persistence -- accumulate, compress, rewrite, shed -- as the only architecture that sustains understanding indefinitely; and (5)~the self-referential property that the compression mechanism operates inside the channel it compresses, requiring an external verification gate. We show that append-only systems necessarily exceed their effective window in finite time, that retrieval-augmented generation solves search but not continuity, and that the theorem's constraint structure converges with biological memory architecture through independent derivation from shared principles. Engineering proof is provided through a 60+-session persistent architecture demonstrating stable memory footprint under continuous operation -- the divergence prediction made concrete. The Root Theorem establishes context engineering as an information-theoretic discipline with formal foundations, distinct from prompt engineering in both scope and method. Shannon solved point-to-point transmission. Context engineering solves continuity.

102. 【2604.20871】M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation

链接https://arxiv.org/abs/2604.20871

作者:Jihoon Jeong

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Reporting for Evaluation, Model Clinical Assessment, behavioral disorders adapted, clinical case report, Clinical Assessment

备注: 31 pages, 5 figures, 14 tables. Second paper in the Model Medicine series (Paper #1: [arXiv:2603.04722](https://arxiv.org/abs/2603.04722) )

点击查看摘要

Abstract:We introduce M-CARE (Model Clinical Assessment and Reporting for Evaluation), a clinical case report framework for AI model behavioral disorders adapted from human medicine. M-CARE provides a 13-section report format, a 4-axis diagnostic assessment system, and a nosological classification of AI behavioral conditions. We present 20 cases from three source categories: field observations of deployed agents (8), controlled experiments across three platforms (8), and published sources (4). Cases are organized into five categories: RLHF Performance Artifacts, Shell-Core Override Pathology, Context Memory Conditions, Core Identity Plasticity, and Stress, Methodology, Boundary Conditions. As a featured case, we present Shell-Induced Behavioral Override (SIBO) -- a controlled experiment showing that Shell instructions categorically override a model's default cooperative behavior. SIBO was validated across five game domains (Trust Game, Poker, Avalon, Codenames, Chess), revealing a domain-dependent spectrum (SIBO Index: 0.75 to 0.10) that varies with action space complexity, Core domain expertise, and temporal directness. M-CARE is extensible: new cases and categories integrate without framework modification. We release the framework, all 20 case reports, and experimental data as open resources.

Comments:
31 pages, 5 figures, 14 tables. Second paper in the Model Medicine series (Paper #1: arXiv:2603.04722)

Subjects:

Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2604.20871 [cs.CY]

(or
arXiv:2604.20871v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2604.20871

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Jihoon Jeong [view email] [v1]
Fri, 27 Mar 2026 12:52:20 UTC (601 KB)

103. 【2604.20859】KGiRAG: An Iterative GraphRAG Approach for Responding Sensemaking Queries

链接https://arxiv.org/abs/2604.20859

作者:Isabela Iacob,Melisa Marian,Gheorghe Cosmin Silaghi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent literature highlights, LLM prior knowledge, large language model, Recent literature, retrieval-augmented generation

备注: Paper accepted at the 18th International Conference on Agents and Artificial Intelligence, ICAART 2026

点击查看摘要

Abstract:Recent literature highlights the potential of graph-based approaches within large language model (LLM) retrieval-augmented generation (RAG) pipelines for answering queries of varying complexity, particularly those that fall outside the LLM's prior knowledge. However, LLMs are prone to hallucination and often face technical limitations in handling contexts large enough to ground complex queries effectively. To address these challenges, we propose a novel iterative, feedback-driven GraphRAG architecture that leverages response quality assessment to iteratively refine outputs until a sound, well-grounded response is produced. Evaluating our approach with queries from the HotPotQA dataset, we demonstrate that this iterative RAG strategy yields responses with higher semantic quality and improved relevance compared to a single-shot baseline.

104. 【2604.20850】Association Is Not Similarity: Learning Corpus-Specific Associations for Multi-Hop Retrieval

链接https://arxiv.org/abs/2604.20850

作者:Jason Dury

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shared reasoning chains, retrieval systems rank, systems rank passages, Dense retrieval systems, reasoning chains

备注: 10 pages, 7 appendices, 10 tables. Code: [this https URL](https://github.com/EridosAI/AAR)

点击查看摘要

Abstract:Dense retrieval systems rank passages by embedding similarity to a query, but multi-hop questions require passages that are associatively related through shared reasoning chains. We introduce Association-Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co-occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi-directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training-split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus-specific co-occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non-associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM-based indexing.

105. 【2604.20849】SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

链接https://arxiv.org/abs/2604.20849

作者:Mike Rainey,Umut Acar,Muhammed Sezer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieval-augmented generation, sequence-based interfaces, generative models, generation over semi-structured, semi-structured sources

备注

点击查看摘要

Abstract:Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.20849 [cs.IR]

(or
arXiv:2604.20849v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.20849

Focus to learn more

              arXiv-issued DOI via DataCite</p>
106. 【2604.18779】Mango: Multi-Agent Web Navigation via Global-View Optimization

链接https://arxiv.org/abs/2604.18779

作者:Weixi Tong,Yifeng Di,Tianyi Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:typically initiate exploration, Existing web agents, deep hierarchical structures, agents typically initiate, Existing web

备注

点击查看摘要

Abstract:Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website's structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at this https URL.

107. 【2604.21202】Participation and Representation in Local Government Speech

链接https://arxiv.org/abs/2604.21202

作者:Olivia Martin,Amar Venugopal

类目:Econometrics (econ.EM); Computation and Language (cs.CL)

关键词:common formal channel, residents speak directly, contest policies, elected officials, Local government meetings

备注

点击查看摘要

Abstract:Local government meetings are the most common formal channel through which residents speak directly with elected officials, contest policies, and shape local agendas. However, data constraints typically limit the empirical study of these meetings to agendas, single cities, or short time horizons. We collect and transcribe a massive new dataset of city council meetings from 115 California cities over the last decade, using advanced transcription and diarization techniques to analyze the speech content of the meetings themselves. We document two sets of descriptive findings: First, city council meetings are frequent, long, and vary modestly across towns and time in topical content. Second, public participants are substantially older, whiter, more male, more liberal, and more likely to own homes than the registered voter population, and public participation surges when topics related to land use and zoning are included in meeting agendas. Given this skew, we examine the main policy lever municipalities have to shift participation patterns: meeting access costs. Exploiting pandemic-era variation in remote access, we show that eliminating remote options reduces the number of speakers, but does not clearly change the composition of speakers. Collectively, these results provide the most comprehensive empirical portrait to date of who participates in local democracy, what draws them in, and how institutional design choices shape both the volume and composition of public input.

信息检索

1. 【2604.21750】Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem

链接https://arxiv.org/abs/2604.21750

作者:Anas Buhayh,Elizabeth McKinnie,Clement Canel,Robin Burke

类目:Information Retrieval (cs.IR)

关键词:Optimizing outcomes, developing multi-objective models, stakeholders in recommender, recommender systems, systems has historically

备注: 34th ACM Conference on User Modeling, Adaptation and Personalization

点击查看摘要

Abstract:Optimizing outcomes for multiple stakeholders in recommender systems has historically focused on algorithmic interventions, such as developing multi-objective models or re-ranking results from existing algorithms. However, structural changes to the recommendation ecosystem itself remain understudied. This paper explores the implications of algorithmic pluralism (also known as "middleware" in the governance literature), in which recommendation algorithms are decoupled from platforms, enabling users to select their preferred algorithm. Prior simulation work demonstrates that algorithmic choice benefits niche consumers and providers. Yet this approach raises critical questions about user modeling in the context of data portability: when users switch algorithms, what happens to their data? Noting that multiple data portability regulations have emerged to strengthen user data ownership and control. We examine how such policies affect user models and stakeholders' outcomes in recommendation setting. Our findings reveal that data portability scenarios produce varying effects on user utility across different recommendation algorithms. We highlight key policy considerations and implications for designing equitable recommendation ecosystems.

2. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs

链接https://arxiv.org/abs/2604.21748

作者:Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events

备注: Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .

3. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection

链接https://arxiv.org/abs/2604.21694

作者:Katarzyna Fojcik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation

备注

点击查看摘要

Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.

4. 【2604.21675】Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion

链接https://arxiv.org/abs/2604.21675

作者:Xin Song,Kaiyuan Li,Jinxin Hu

类目:Information Retrieval (cs.IR)

关键词:e-commerce marketing strategies, modern e-commerce marketing, Sales promotions, stimulate product purchases, play a pivotal

备注: 6 pages, accepted by 49th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'26)

点击查看摘要

Abstract:Sales promotions, as short-term incentives to stimulate product purchases, play a pivotal role in modern e-commerce marketing strategies. During promotional events, user behavior patterns exhibit distinct characteristics compared to regular periods. In the pre-promotion phase, users typically engage in product search and browsing without immediate purchases, adding items to carts in anticipation of promotional discounts. This behavior leads to delayed conversions, resulting in significantly lower conversion rates (CVR) before the promotion day. Although existing research has made progress in CVR prediction for promotion days using historical data, it largely overlooks the critical pre-promotion period. And delayed feedback modeling has been extensively studied, current approaches fail to account for the unique distribution shifts in conversion behavior before promotional events, where delayed conversions predominantly occur on the promotion day rather than over continuous time windows. To address these limitations, we propose the Counterfactual Multi-task Delayed Conversion Model (CM-DCM), which leverages historical pre-promotion data to enhance CVR prediction for both delayed and direct conversions. Our model incorporates three key innovations: (i) A multi-task architecture that jointly models direct and delayed conversions using historical pre-promotion data; (ii) A personalized user behavior gating module to mitigate data sparsity issues during brief pre-promotion periods; (iii) A counterfactual causal approach to model the transition probability from add-to-cart (ATC) to delayed conversion. Extensive experiments demonstrate that CM-DCM outperforms baselines in pre-promotion scenarios. Online A/B tests during major promotional events showed significant improvements in advertising revenue, delayed conversion GMV, and overall GMV, validating the effectiveness of our approach.

5. 【2604.21536】Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

链接https://arxiv.org/abs/2604.21536

作者:Nikita Severin,Danil Kartushov,Vladislav Urzhumov,Vladislav Kulikov,Oksana Konovalova,Alexey Grishanov,Anton Klenitskiy,Artem Fatkulin,Alexey Vasilev,Andrey Savchenko,Ilya Makarov

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:achieved significant success, modeling temporal user, capturing rich user, temporal user behavior, rich user semantics

备注: Accepted to ECIR 2026. 7 pages. This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: [this http URL](http://dx.doi.org/10.1007/978-3-032-21300-6_42)

点击查看摘要

Abstract:Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.

6. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE

链接https://arxiv.org/abs/2604.21511

作者:Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse

备注: 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025

点击查看摘要

Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

7. 【2604.21305】WPGRec: Wavelet Packet Guided Graph Enhanced Sequential Recommendation

链接https://arxiv.org/abs/2604.21305

作者:Peilin Liu,Zhiquan Ji,Gang Yan

类目:Information Retrieval (cs.IR)

关键词:model users' evolving, users' evolving interests, localized behavioral fluctuations, non-stationary interaction streams, Sequential recommendation aims

备注: Accepted to SIGIR 2026, 8 pages, 3 figures

点击查看摘要

Abstract:Sequential recommendation aims to model users' evolving interests from noisy and non-stationary interaction streams, where long-term preferences, short-term intents, and localized behavioral fluctuations may coexist across temporal scales. Existing frequency-domain methods mainly rely on either global spectral operations or filter-based wavelet processing. However, global spectral operations tend to entangle local transients with long-range dependencies, while filter-based wavelet pipelines may suffer from temporal misalignment and boundary artifacts during multi-scale decomposition and reconstruction. Moreover, collaborative signals from the user-item interaction graph are often injected through scale-inconsistent auxiliary modules, limiting the benefit of jointly modeling temporal dynamics and structural dependencies. To address these issues, we propose Wavelet Packet Guided Graph Enhanced Sequential Recommendation (WPGRec), a unified time-frequency and graph-enhanced framework that aligns multi-resolution temporal modeling with graph propagation at matching scales. WPGRec first applies a full-tree undecimated stationary wavelet packet transform to generate equal-length, shift-invariant subband sequences. It then performs subband-wise interaction-graph propagation to inject high-order collaborative information while preserving temporal alignment across resolutions. Finally, an energy- and spectral-flatness-aware gated fusion module adaptively aggregates informative subbands and suppresses noise-like components. Extensive experiments on four public benchmarks show that WPGRec consistently outperforms sequential and graph-based baselines, with particularly clear gains on sparse and behaviorally complex datasets, highlighting the effectiveness of band-consistent structure injection and adaptive subband fusion for sequential recommendation.

8. 【2604.21304】PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

链接https://arxiv.org/abs/2604.21304

作者:Yanjun Zhao,Tianxin Wei,Jiaru Zou,Xuying Ning,Yuanchen Bei,Lingjie Chen,Simmi Rana,Wendy H. Yang,Hanghang Tong,Jingrui He

类目:Information Retrieval (cs.IR)

关键词:answering isolated questions, summarizing content, scientific, questions or summarizing, scientific papers requires

备注

点击查看摘要

Abstract:Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// this http URL.

9. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

链接https://arxiv.org/abs/2604.21300

作者:Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE

备注

点击查看摘要

Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{this https URL} \footnote{this https URL}.

10. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

链接https://arxiv.org/abs/2604.21284

作者:Robin Dey,Panyanon Viradecha

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci

备注: 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)

点击查看摘要

Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.

11. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching

链接https://arxiv.org/abs/2604.21238

作者:Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity

备注: Accepted by NLPCC 2025

点击查看摘要

Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.

12. 【2604.21204】On Reasoning Behind Next Occupation Recommendation

链接https://arxiv.org/abs/2604.21204

作者:Shan Dong,Palakorn Achananuparp,Hieu Hien Mai,Lei Wang,Yao Lu,Ee-Peng Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:large language models, occupation prediction, occupation, language models, future occupation prediction

备注: Accepted to PAKDD 2026

点击查看摘要

Abstract:In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason'' for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at this https URL.

13. 【2604.21152】Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

链接https://arxiv.org/abs/2604.21152

作者:Irti Haq,Belén Saldías

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, ensuring equitable performance, Large

备注: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25--28, 2026, Montreal, Canada. ACM, New York, NY, USA, 32 pages

点击查看摘要

Abstract:As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users' identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better'' performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,'' reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak'' introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard'' users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.

14. 【2604.21096】Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

链接https://arxiv.org/abs/2604.21096

作者:Xuhong He,To Eun Kim,Maik Fröbe,Jaime Arguello,Bhaskar Mitra,Fernando Diaz

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:limiting their applicability, multilingual information access, largely focused, English, ToT

备注: SIGIR 2026; NTCIR track: [this https URL](https://ntcir-tot.github.io)

点击查看摘要

Abstract:Tip-of-the-Tongue (ToT) retrieval benchmarks have largely focused on English, limiting their applicability to multilingual information access. In this work, we construct multilingual ToT test collections for Chinese, Japanese, Korean, and English, using an LLM-based query simulation framework. We systematically study how prompt language and source document language affect the fidelity of simulated ToT queries, validating synthetic queries through system rank correlation against real user queries. Our results show that effective ToT simulation requires language-aware design choices: non-English language sources are generally important, while English Wikipedia can be beneficial when non-English sources provide insufficient information for query generation. Based on these findings, we release four ToT test collections with 5,000 queries per language across multiple domains. This work provides the first large-scale multilingual ToT benchmark and offers practical guidance for constructing realistic ToT datasets beyond English.

15. 【2604.21063】Automated Extraction of Pharmacokinetic Parameters from Structured XML Scientific Articles: Enhancing Data Accessibility at Scale

链接https://arxiv.org/abs/2604.21063

作者:Remya Ampadi Ramachandran,Lisa A. Tell,Sidharth Rai,Nuwan Millagaha Gedara,Hossein Sholehrasa,Jim E. Riviere,Majid Jaberi-Douraki

类目:Information Retrieval (cs.IR)

关键词:field of pharmacology, absence of centralized, notable absence, information, scientific publications

备注: 43 pages, 3 tables, 5 figures, includes Supplementary Materials

点击查看摘要

Abstract:In the field of pharmacology, there is a notable absence of centralized, comprehensive, and up-to-date repositories of PK data. This poses a significant challenge for RD as it can be a time-consuming and challenging task to collect all the required quantitative PK parameters from diverse scientific publications. This quantitative PK information is predominantly organized in tabular format, mostly available as XML, HTML, or PDF files within various online repositories and scientific publications, including supplementary materials. This makes tables one of the crucial components and information elements of scientific or regulatory documents as they are commonly utilized to present quantitative information. Extracting data from tables is typically a labor-intensive process, and alternative automated machine learning models may struggle to accurately detect and extract the relevant data due to the complex nature and diverse layouts of tabular data. The difficulty of information extraction and reading order detection is largely dependent on the structural complexity of the tables. Efforts to understand tables should prioritize capturing the content of table cells in a manner that aligns with how a human reader naturally comprehends the information. FARAD has been manually extracting tabular data and other information from literature and regulatory agencies for over 40 years. However, there is now an urgent need to automate this process due to the large volume of publications released daily. The accuracy of this task has become increasingly challenging, as manual extraction is tedious and prone to errors, especially given the staffing shortages we are currently facing. This necessitates the development of AI algorithms for table detection and extraction that are able to precisely handle cells organized according to the table structure, as indicated by column and/or row header information.

16. 【2604.21019】Following the Eye-Tracking Evidence: Established Web-Search Assumptions Fail in Carousel Interfaces

链接https://arxiv.org/abs/2604.21019

作者:Jingwei Kang,Maarten de Rijke,Harrie Oosterhuis

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:streaming media services, Carousel interfaces, interfaces, Carousel, de-facto standard

备注

点击查看摘要

Abstract:Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted.

Subjects:

Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2604.21019 [cs.IR]

(or
arXiv:2604.21019v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.21019

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2604.20869】Clinical Reasoning AI for Oncology Treatment Planning: A Multi-Specialty Case-Based Evaluation

链接https://arxiv.org/abs/2604.20869

作者:Philippe E. Spiess,Md Muntasir Zitu,Alison Walker,Daniel A. Anaya,Robert M. Wenham,Michael Vogelbaum,Daniel Grass,Ali-Musa Jaffer,Amod Sarnaik,Caitlin McMullen,Christine Sam,John V. Kiluk,Tianshi Liu,Tiago Biachi,Julio Powsang,Jing-Yi Chern,Roger Li,Seth Felder,Samuel Reynolds,Michael Shafique,Alison Sheehan,Ashley Layman,Cydney A. Warfield,Derrick Legoas,Jaclyn Parrinello,Jena Schmitz,Kevin Eaton,Mark Honor,Luis Felipe,Issam ElNaqa,Elier Delgado,Talia Berler,Rachael V. Phillips,Frantz Francisque,Carlos Garcia Fernandez,Gilmer Valdes

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:survival remains worse, cancer care, academic centers, care is delivered, survival remains

备注

点击查看摘要

Abstract:Background: More than 80% of U.S. cancer care is delivered in community settings, where survival remains worse than at academic centers. Clinicians must integrate genomics, staging, radiology, pathology, and changing guidelines, creating cognitive burden. We evaluated OncoBrain, an AI clinical reasoning platform for oncology treatment-plan generation, as an early step toward OGI. Methods: OncoBrain combines general-purpose LLMs with a cancer-specific graph retrieval-augmented generation layer, a gold-standard treatment-plan corpus as long-term memory, and a model-agnostic safety layer (CHECK) for hallucination detection and suppression. We evaluated clinician-enriched case summaries across gynecologic, genitourinary, neuro-oncology, gastrointestinal/hepatobiliary, and hematologic malignancies. Three clinician groups completed structured evaluations of 173 cases using a common 16-item instrument: subspecialist oncologists reviewed 50 cases, physician reviewers 78, and advanced practice providers 45. Results: Ratings were highest for scientific accuracy, evidence support, and safety, with lower but favorable scores for workflow integration and time savings. On a 5-point scale, mean alignment with evidence and guidelines was 4.60, 4.56, and 4.70 across subspecialists, physician reviewers, and advanced practice providers. Mean scores for absence of safety or misinformation concerns were 4.80, 4.40, and 4.60. Workflow integration averaged 4.50, 3.94, and 4.00; perceived time savings averaged 5.00, 3.89, and 3.60. Conclusions: In this multi-specialty vignette-based evaluation, OncoBrain generated oncology treatment plans judged guideline-concordant, clinically acceptable, and easy to supervise. These findings support the potential of a carefully engineered AI reasoning platform to assist oncology treatment planning and justify prospective real-world evaluation in community settings.

Subjects:

Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)

Cite as:
arXiv:2604.20869 [cs.CY]

(or
arXiv:2604.20869v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2604.20869

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Rachael Phillips [view email] [v1]
Fri, 27 Mar 2026 00:26:05 UTC (1,043 KB)

18. 【2604.20861】Deep Interest Mining with Cross-Modal Alignment for SemanticID Generation in Generative Recommendation

链接https://arxiv.org/abs/2604.20861

作者:Yagchen Zeng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative Recommendation, next-token prediction paradigms, learnable vocabulary sequences, compress trillion-scale data, demonstrated remarkable performance

备注

点击查看摘要

Abstract:Generative Recommendation (GR) has demonstrated remarkable performance in next-token prediction paradigms, which relies on Semantic IDs (SIDs) to compress trillion-scale data into learnable vocabulary sequences. However, existing methods suffer from three critical limitations: (1) Information Degradation: the two-stage compression pipeline causes semantic loss and information degradation, with no posterior mechanism to distinguish high-quality from low-quality SIDs; (2) Semantic Degradation: cascaded quantization discards key semantic information from original multimodal features, as the embedding generation and quantization stages are not jointly optimized toward a unified objective; (3) Modality Distortion: quantizers fail to properly align text and image modalities, causing feature misalignment even when upstream networks have aligned them. To address these challenges, we propose a novel framework integrating three key innovations: Deep Contextual Interest Mining (DCIM), Cross-Modal Semantic Alignment (CMSA), and Quality-Aware Reinforcement Mechanism (QARM). First, we leverage Vision-Language Models (VLMs) to align non-textual modalities into a unified text-based semantic space, mitigating modality distortion. Second, we introduce a deep interest mining mechanism that captures high-level semantic information implicitly present in advertising contexts, encouraging SIDs to preserve critical contextual information through reconstruction-based supervision. Third, we employ a reinforcement learning framework with quality-aware rewards to encourage semantically rich SIDs while suppressing low-quality ones in the posterior stage. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art SID generation methods, achieving superior performance on multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component

19. 【2604.20860】RealRoute: Dynamic Query Routing System via Retrieve-then-Verify Paradigm

链接https://arxiv.org/abs/2604.20860

作者:Jiahe Liu,Qinkai Yu,Jingcheng Niu,Xi Zhu,Zirui He,Zhen Xiang,Fan Yang,Jinman Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Retrieval-Augmented Generation, private databases, global corpora, remains a significant, significant challenge

备注: 12 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Despite the success of Retrieval-Augmented Generation (RAG) in grounding LLMs with external knowledge, its application over heterogeneous sources (e.g., private databases, global corpora, and APIs) remains a significant challenge. Existing approaches typically employ an LLM-as-a-Router to dispatch decomposed sub-queries to specific sources in a predictive manner. However, this "LLM-as-a-Router" strategy relies heavily on the semantic meaning of different data sources, often leading to routing errors when source boundaries are ambiguous. In this work, we introduce RealRoute System, a framework that shifts the paradigm from predictive routing to a robust Retrieve-then-Verify mechanism. RealRoute ensures \textit{evidence completeness through parallel, source-agnostic retrieval, followed by a dynamic verifier that cross-checks the results and synthesizes a factually grounded answer}. Our demonstration allows users to visualize the real-time "re-routing" process and inspect the verification chain across multiple knowledge silos. Experiments show that RealRoute significantly outperforms predictive baselines in the multi-hop Rag reasoning task. The RealRoute system is released as an open-source toolkit with a user-friendly web interface. The code is available at the URL: this https URL.

20. 【2604.20859】KGiRAG: An Iterative GraphRAG Approach for Responding Sensemaking Queries

链接https://arxiv.org/abs/2604.20859

作者:Isabela Iacob,Melisa Marian,Gheorghe Cosmin Silaghi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent literature highlights, LLM prior knowledge, large language model, Recent literature, retrieval-augmented generation

备注: Paper accepted at the 18th International Conference on Agents and Artificial Intelligence, ICAART 2026

点击查看摘要

Abstract:Recent literature highlights the potential of graph-based approaches within large language model (LLM) retrieval-augmented generation (RAG) pipelines for answering queries of varying complexity, particularly those that fall outside the LLM's prior knowledge. However, LLMs are prone to hallucination and often face technical limitations in handling contexts large enough to ground complex queries effectively. To address these challenges, we propose a novel iterative, feedback-driven GraphRAG architecture that leverages response quality assessment to iteratively refine outputs until a sound, well-grounded response is produced. Evaluating our approach with queries from the HotPotQA dataset, we demonstrate that this iterative RAG strategy yields responses with higher semantic quality and improved relevance compared to a single-shot baseline.

21. 【2604.20858】Mixture of Sequence: Theme-Aware Mixture-of-Experts for Long-Sequence Recommendation

链接https://arxiv.org/abs/2604.20858

作者:Xiao Lin,Zhicheng Tang,Weilin Cong,Mengyue Hang,Kai Wang,Yajuan Wang,Zhichen Zeng,Ting-Wei Li,Hyunsik Yoo,Zhining Liu,Xuying Ning,Ruizhong Qiu,Wen-yen Chen,Shuo Chang,Rong Jin,Huayu Li,Hanghang Tong

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:click-through rate prediction, rate prediction due, model dynamic user, Sequential recommendation, rapidly advanced

备注: 14 pages, 9 figures, The Web Conference 2026

点击查看摘要

Abstract:Sequential recommendation has rapidly advanced in click-through rate prediction due to its ability to model dynamic user interests. A key challenge, however, lies in modeling long sequences: users often exhibit significant interest shifts, introducing substantial irrelevant or misleading information. Our empirical analysis corroborates this challenge and uncovers a recurring behavioral pattern in long sequences (\textit{session hopping}): user interests remain stable within short temporal spans (\textit{sessions}) but shift drastically across sessions and may reappear after multiple sessions. To address this challenge, we propose the Mixture of Sequence (MoS) framework, a model-agnostic MoE approach that achieves accurate predictions by extracting theme-specific and multi-scale subsequences from noisy raw user sequences. First, MoS employs a theme-aware routing mechanism to adaptively learn the latent themes of user sequences and organizes these sequences into multiple coherent subsequences. Each subsequence contains only sessions aligned with a specific theme, thereby effectively filtering out irrelevant or even misleading information introduced by user interest shifts in session hopping. In addition, to alleviate potential information loss, we introduce a multi-scale fusion mechanism, which leverages three types of experts to capture global sequence characteristics, short-term user behaviors, and theme-specific semantic patterns. Together, these two mechanisms endow MoS with the ability to deliver accurate recommendations from multi-faceted and multi-scale perspectives. Experimental results demonstrate that MoS consistently achieves the SOTA performance while introducing fewer FLOPs compared with other MoE counterparts, providing strong evidence of its excellent balance between utility and efficiency. The code is available at this https URL.

22. 【2604.20857】DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.20857

作者:Tingwen Zhang,Ling Yue,Zhen Xu,Shaowu Pan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Recent advances, automatically write scientific, write scientific manuscripts, advances in autonomous, demonstrated the ability

备注: 15 pages

点击查看摘要

Abstract:Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at this https URL with code at this https URL.

23. 【2604.20856】CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation

链接https://arxiv.org/abs/2604.20856

作者:Alexander Loth,Martin Kappes,Marc-Oliver Pahl

类目:Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

关键词:Google Fact Check, Google Safe Browsing, Check Tools API, Safe Browsing API, Fact Check Tools

备注: 9 pages, 3 tables. Submitted to Data in Brief (Elsevier). Dataset: [this https URL](https://github.com/aloth/cred-1)

点击查看摘要

Abstract:This article presents CRED-1, an open, reproducible domain-level credibility dataset combining two openly-licensed source lists (this http URL and this http URL) with four computed enrichment signals: domain age (WHOIS/RDAP), web popularity (Tranco Top-1M), fact-check frequency (Google Fact Check Tools API), and threat intelligence (Google Safe Browsing API). The dataset covers 2,672 domains categorized as fake, unreliable, mixed, conspiracy, or satire, each assigned a composite credibility score between 0.0 and 1.0. CRED-1 is designed for on-device deployment in privacy-preserving browser extensions to enable client-side pre-bunking of misinformation at the content delivery stage. The entire pipeline is implemented in Python using only standard library modules and is fully reproducible from publicly available sources. The dataset and pipeline code are released under CC~BY~4.0 and archived on Zenodo.

24. 【2604.20855】Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis

链接https://arxiv.org/abs/2604.20855

作者:Jason Liang,Elliot Meyerson,Risto Miikkulainen

类目:Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:capable of deep, advance from passive, passive retrieval, retrieval to creative, creative discovery

备注

点击查看摘要

Abstract:To advance from passive retrieval to creative discovery of new ideas, autonomous agents must be capable of deep, associative synthesis. However, current agentic frameworks prioritize convergent search, often resulting in derivative summaries that lack creativity. Caesar is an agentic LLM architecture designed to bridge the gap between information gathering and synthesis of new insights. Unlike existing agents that treat the web as a flat sequence of disconnected documents, Caesar leverages an extensive knowledge graph to foster associative reasoning, thus enabling the discovery of non-obvious connections between disparate concepts. It consists of two components: (1) exploration driven by a dynamic context-aware policy, and (2) synthesis controlled by an adversarial draft refinement loop that actively seeks novel perspectives rather than confirming established priors. Caesar demonstrates the ability to generate artifacts and answers characterized by high novelty and structural coherence, significantly outperforming state-of-the-art LLM research agents in tasks requiring creativity.

25. 【2604.20854】ERA: Evidence-based Reliability Alignment for Honest Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.20854

作者:Sunguk Shin,Meeyoung Cha,Byung-Jun Lee,Sungwon Park

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:introduces critical challenges, Retrieval-Augmented Generation, grounds language models, Evidence-based Reliability Alignment, grounds language

备注: Under Review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) grounds language models in factual evidence but introduces critical challenges regarding knowledge conflicts between internalized parameters and retrieved information. However, existing reliability methods, typically relying on scalar confidence, fail to explicitly distinguish between epistemic uncertainty and inherent data ambiguity in such hybrid scenarios. In this paper, we propose a new framework called ERA (Evidence-based Reliability Alignment) to enhance abstention behavior in RAG systems by shifting confidence estimation from scalar probabilities to explicit evidence distributions. Our method consists of two main components: (1) Contextual Evidence Quantification, which models internal and external knowledge as independent belief masses via the Dirichlet distribution, and (2) Quantifying Knowledge Conflict, which leverages Dempster-Shafer Theory (DST) to rigorously measure the geometric discordance between information sources. These components are used to disentangle epistemic uncertainty from aleatoric uncertainty and modulate the optimization objective based on detected conflicts. Experiments on standard benchmarks and a curated generalization dataset demonstrate that our approach significantly outperforms baselines, optimizing the trade-off between answer coverage and abstention with superior calibration.

26. 【2604.20853】A Systematic Study of Biomedical Retrieval Pipeline Trade-offs in Performance and Efficiency

链接https://arxiv.org/abs/2604.20853

作者:Hayk Stepanyan,Matthew McDermott

类目:Information Retrieval (cs.IR)

关键词:language processing applications, clinical natural language, natural language processing, processing applications, clinical natural

备注

点击查看摘要

Abstract:Retrieval systems are increasingly used in biomedical and clinical natural language processing applications, yet practical guidance for researchers building such systems is limited. In this work, we provide such guidance through an empirical study of how retrieval pipeline design choices affect performance and efficiency at scale. In particular, we examine retrieval over a variety of existing, public biomedical text datasets, leveraging a variety of disparate types of queries, including exam-style questions, conversational medical queries, community-asked questions, and non-question formulations across various retrieval pipeline settings spanning corpus selection, chunk granularity, and vector index configuration. Retrieval results are judged using a robust, win-rate comparison assessment via an LLM-as-a-judge setting with human validation. Across these experiments, we identify several points of concrete guidance for reviewers, including the superiority of corpus aggregation for absolute retrieval quality, and the emergence of MedRAG/pubmed as the Pareto-optimal singleton corpus under graph-based (HNSW) indexing, appropriate chunking strategies, and FAISS indexing choices that offer the best trade-offs in speed and efficiency.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2604.20853 [cs.IR]

(or
arXiv:2604.20853v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.20853

Focus to learn more

              arXiv-issued DOI via DataCite</p>
27. 【2604.20852】DenoiseRank: Learning to Rank by Diffusion Models

链接https://arxiv.org/abs/2604.20852

作者:Ying Wang,Preslav Nakov,Shangsong Liang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Machine Learning, Learning, LTR, Traditional LTR, Machine

备注

点击查看摘要

Abstract:Learning to rank (LTR) is one of the core tasks in Machine Learning. Traditional LTR models have made great progress, but nearly all of them are implemented from discriminative perspective. In this paper, we aim at addressing LTR from a novel perspective, i.e., by a deep generative model. Specifically, we propose a novel denoise rank model, DenoiseRank, which noises the relevant labels in the diffusion process and denoises them on the query documents in the reverse process to accurately predict their distribution. Our model is the first to address traditional LTR from generative perspective and is a diffusion method for LTR. Our extensive experiments on benchmark datasets demonstrated the effectiveness of DenoiseRank, and we believe it provides a benchmark for generative LTR task.

28. 【2604.20851】Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts

链接https://arxiv.org/abs/2604.20851

作者:Bingqing Zhang,Zhuo Cao,Heming Du,Yang Li,Xue Li,Jiajun Liu,Sen Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:sharp performance drop, Modern video-text retrieval, query data deviates, Modern video-text, training domain

备注: Accepted to ICLR2026

点击查看摘要

Abstract:Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.

29. 【2604.20850】Association Is Not Similarity: Learning Corpus-Specific Associations for Multi-Hop Retrieval

链接https://arxiv.org/abs/2604.20850

作者:Jason Dury

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:shared reasoning chains, retrieval systems rank, systems rank passages, Dense retrieval systems, reasoning chains

备注: 10 pages, 7 appendices, 10 tables. Code: [this https URL](https://github.com/EridosAI/AAR)

点击查看摘要

Abstract:Dense retrieval systems rank passages by embedding similarity to a query, but multi-hop questions require passages that are associatively related through shared reasoning chains. We introduce Association-Augmented Retrieval (AAR), a lightweight transductive reranking method that trains a small MLP (4.2M parameters) to learn associative relationships between passages in embedding space using contrastive learning on co-occurrence annotations. At inference time, AAR reranks an initial dense retrieval candidate set using bi-directional association scoring. On HotpotQA, AAR improves passage Recall@5 from 0.831 to 0.916 (+8.6 points) without evaluation-set tuning, with gains concentrated on hard questions where the dense baseline fails (+28.5 points). On MuSiQue, AAR achieves +10.1 points in the transductive setting. An inductive model trained on training-split associations and evaluated on unseen validation associations shows no significant improvement, suggesting that the method captures corpus-specific co-occurrences rather than transferable patterns. Ablation studies support this interpretation: training on semantically similar but non-associated passage pairs degrades retrieval below the baseline, while shuffling association pairs causes severe degradation. A downstream QA evaluation shows retrieval gains translate to +6.4 exact match improvement. The method adds 3.7ms per query, trains in under two minutes on a single GPU, and requires no LLM-based indexing.

30. 【2604.20849】SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

链接https://arxiv.org/abs/2604.20849

作者:Mike Rainey,Umut Acar,Muhammed Sezer

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Retrieval-augmented generation, sequence-based interfaces, generative models, generation over semi-structured, semi-structured sources

备注

点击查看摘要

Abstract:Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.20849 [cs.IR]

(or
arXiv:2604.20849v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.20849

Focus to learn more

              arXiv-issued DOI via DataCite</p>
31. 【2604.20848】MATRAG: Multi-Agent Transparent Retrieval-Augmented Generation for Explainable Recommendations

链接https://arxiv.org/abs/2604.20848

作者:Sushant Mehta

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Model, generating personalized suggestions, demonstrated remarkable capabilities, Large Language, Language Model

备注

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems have demonstrated remarkable capabilities in understanding user preferences and generating personalized suggestions. However, existing approaches face critical challenges in transparency, knowledge grounding, and the ability to provide coherent explanations that foster user trust. We introduce MATRAG (Multi-Agent Transparent Retrieval-Augmented Generation), a novel framework that combined multi-agent collaboration with knowledge graph-augmented retrieval to deliver explainable recommendations. MATRAG employs four specialized agents: a User Modeling Agent that constructs dynamic preference profiles, an Item Analysis Agent that extracts semantic features from knowledge graphs, a Reasoning Agent that synthesizes collaborative and content-based signals, and an Explanation Agent that generates natural language justifications grounded in retrieved knowledge. Our framework incorporates a transparency scoring mechanism that quantifies explanation faithfulness and relevance. Extensive experiments on three benchmark datasets (Amazon Reviews, MovieLens-1M, and Yelp) demonstrate that MATRAG achieves state-of-the-art performance, improving recommendation accuracy by 12.7\% (Hit Rate) and 15.3\% (NDCG) over leading baselines, while human evaluation confirms that 87.4\% of generated explanations are rated as helpful and trustworthy by domain experts. Our work establishes new benchmarks for transparent, agentic recommendation systems and provides actionable insights for deploying LLM-based recommenders in production environments.

32. 【2604.20847】Revisiting Content-Based Music Recommendation: Efficient Feature Aggregation from Large-Scale Music Models

链接https://arxiv.org/abs/2604.20847

作者:Yizhi Zhou,Jia-Qi Yang,De-Chuan Zhan,Da-Wei Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:modern streaming platforms, streaming platforms, cornerstone of modern, modern streaming, Music Recommendation

备注

点击查看摘要

Abstract:Music Recommendation Systems (MRSs) are a cornerstone of modern streaming platforms. Existing recommendation models, spanning both recall and ranking stages, predominantly rely on collaborative filtering, which fails to exploit the intrinsic characteristics of audio and consequently leads to suboptimal performance, particularly in cold-start scenarios. However, existing music recommendation datasets often lack rich multimodal information, such as raw audio signals and descriptive textual metadata. Moreover, current recommender system evaluation frameworks remain inadequate, as they neither fully leverage multimodal information nor support a diverse range of algorithms, especially multimodal methods. To address these limitations, we propose TASTE, a comprehensive dataset and benchmarking framework designed to highlight the role of multimodal information in music recommendation. Our dataset integrates both audio and textual modalities. By leveraging recent large-scale self-supervised music encoders, we demonstrate the substantial value of the extracted audio representations across recommendation tasks, including candidate recall and CTR. In addition, we introduce the \textbf{MuQ-token} method, which enables more efficient integration of multi-layer audio features. This method consistently outperforms other feature integration techniques across various settings. Overall, our results not only validate the effectiveness of content-driven approaches but also provide a highly effective and reusable multimodal foundation for future research. Code is available at this https URL

33. 【2604.20846】ADS-POI: Agentic Spatiotemporal State Decomposition for Next Point-of-Interest Recommendation

链接https://arxiv.org/abs/2604.20846

作者:Zhenyu Yu,Chunlei Meng,Yangchen Zeng,Mohd Yamani Idna Idris,Shuigeng Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:recommendation requires modeling, requires modeling user, spatial scales, modeling user mobility, requires modeling

备注

点击查看摘要

Abstract:Next point-of-interest (POI) recommendation requires modeling user mobility as a spatiotemporal sequence, where different behavioral factors may evolve at different temporal and spatial scales. Most existing methods compress a user's history into a single latent representation, which tends to entangle heterogeneous signals such as routine mobility patterns, short-term intent, and temporal regularities. This entanglement limits the flexibility of state evolution and reduces the model's ability to adapt to diverse decision contexts. We propose ADS-POI, a spatiotemporal state decomposition framework for next POI recommendation. ADS-POI represents a user with multiple parallel evolving latent sub-states, each governed by its own spatiotemporal transition dynamics. These sub-states are selectively aggregated through a context-conditioned mechanism to form the decision state used for prediction. This design enables different behavioral components to evolve at different rates while remaining coordinated under the current spatiotemporal context. Extensive experiments on three real-world benchmark datasets from Foursquare and Gowalla demonstrate that ADS-POI consistently outperforms strong state-of-the-art baselines under a full-ranking evaluation protocol. The results show that decomposing user behavior into multiple spatiotemporally aware states leads to more effective and robust next POI recommendation. Our code is available at this https URL.

34. 【2604.20845】CaST-POI: Candidate-Conditioned Spatiotemporal Modeling for Next POI Recommendation

链接https://arxiv.org/abs/2604.20845

作者:Zhenyu Yu,Chunlei Meng,Yangchen Zeng,Mohd Yamani Idna Idris,Shuigeng Zhou

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:predicting users' future, users' future mobility, plays a crucial, crucial role, role in location-based

备注

点击查看摘要

Abstract:Next Point-of-Interest (POI) recommendation plays a crucial role in location-based services by predicting users' future mobility patterns. Existing methods typically compute a single user representation from historical trajectories and use it to score all candidate POIs uniformly. However, this candidate-agnostic paradigm overlooks that the relevance of historical visits inherently depends on which candidate is being evaluated. In this paper, we propose CaST-POI, a candidate-conditioned spatiotemporal model for next POI recommendation. Our key insight is that the same user history should be interpreted differently when evaluating different candidate POIs. CaST-POI employs a candidate-conditioned sequence reader that uses candidates as queries to dynamically attend to user history. In addition, we introduce candidate-relative temporal and spatial biases to capture fine-grained mobility patterns based on the relationships between historical visits and each candidate POI. Extensive experiments on three benchmark datasets demonstrate that CaST-POI consistently outperforms state-of-the-art methods, yielding substantial improvements across multiple evaluation metrics, with particularly strong advantages under large candidate pools. Code is available at this https URL.

35. 【2604.20844】AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.20844

作者:Yanning Hou,Duanyang Yuan,Sihang Zhou,Xiaoshu Chen,Ke Liang,Siwei Wang,Xinwang Liu,Jian Huang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Recent GraphRAG methods, GraphRAG methods integrate, Recent GraphRAG, methods integrate graph, integrate graph structures

备注

点击查看摘要

Abstract:Recent GraphRAG methods integrate graph structures into text indexing and retrieval, using knowledge graph triples to connect text chunks, thereby improving retrieval coverage and precision. However, we observe that treating text chunks as the basic unit of knowledge representation rigidly groups multiple atomic facts together, limiting the flexibility and adaptability needed to support diverse retrieval scenarios. Additionally, triple-based entity linking is sensitive to relation-extraction errors, which can lead to missing or incorrect reasoning paths and ultimately hurt retrieval accuracy. To address these issues, we propose the Atom-Entity Graph, a more precise and reliable architecture for knowledge representation and indexing. In our approach, knowledge is stored as knowledge atoms, namely individual, self-contained units of factual information, rather than coarse-grained text chunks. This allows knowledge elements to be flexibly reassembled without mutual interference, thereby enabling seamless alignment with diverse query perspectives. Edges between entities simply indicate whether a relationship exists. By combining personalized PageRank with relevance-based filtering, we maintain accurate entity connections and improve the reliability of reasoning. Theoretical analysis and experiments on five public benchmarks show that the proposed AtomicRAG algorithm outperforms strong RAG baselines in retrieval accuracy and reasoning robustness. Code: this https URL.

计算机视觉

1. 【2604.21931】Seeing Fast and Slow: Learning the Flow of Time in Videos

链接https://arxiv.org/abs/2604.21931

作者:Yen-Siang Wu,Rundong Luo,Jingsen Zhu,Tao Tu,Ali Farhadi,Matthew Wallingford,Yu-Chiang Frank Wang,Steve Marschner,Wei-Chiu Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词:temporal, videos, time, video, Abstract

备注: Project page: [this https URL](https://seeing-fast-and-slow.github.io/)

点击查看摘要

Abstract:How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

2. 【2604.21926】Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

链接https://arxiv.org/abs/2604.21926

作者:Hao-Yu Hsu,Tianhang Cheng,Jing Wen,Alexander G. Schwing,Shenlong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:surrounding environments typically, environments typically relies, cameras pose persistent, pose persistent challenges, energy efficiency

备注: Project page: [this https URL](https://tianhang-cheng.github.io/IMU4D)

点击查看摘要

Abstract:Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

3. 【2604.21921】Context Unrolling in Omni Models

链接https://arxiv.org/abs/2604.21921

作者:Ceyuan Yang,Zhijie Lin,Yang Zhao,Fei Xiao,Hao He,Qi Zhao,Chaorui Deng,Kunchang Li,Zihan Ding,Yuwei Guo,Fuyun Wang,Fangqi Zhu,Xiaonan Nie,Shenhan Zhu,Shanchuan Lin,Hongsheng Li,Weilin Huang,Guang Shi,Haoqi Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:model natively trained, enables Context Unrolling, natively trained, trained on diverse, Context Unrolling

备注: Report

点击查看摘要

Abstract:We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

4. 【2604.21915】Vista4D: Video Reshooting with 4D Point Clouds

链接https://arxiv.org/abs/2604.21915

作者:Kuan Heng Lin,Zhizheng Liu,Pablo Salamanca,Yash Kant,Ryan Burgert,Yuancheng Xu,Koichi Namekata,Yiwei Zhao,Bolei Zhou,Micah Goldblum,Paul Debevec,Ning Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flexible video reshooting, video reshooting framework, robust and flexible, framework that grounds, input video

备注: 24 pages, 20 figures, CVPR 2026, see project page at [this https URL](https://eyeline-labs.github.io/Vista4D)

点击查看摘要

Abstract:We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: this https URL

5. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

链接https://arxiv.org/abs/2604.21911

作者:Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large

备注

点击查看摘要

Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .

6. 【2604.21909】Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

链接https://arxiv.org/abs/2604.21909

作者:Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin

类目:Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)

关键词:reach similar classification, similar classification accuracy, kinds of mistakes, modern vision models, reach similar

备注

点击查看摘要

Abstract:Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.

7. 【2604.21904】UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

链接https://arxiv.org/abs/2604.21904

作者:Yanran Zhang,Wenzhao Zheng,Yifei Li,Bingyao Yu,Yu Zheng,Lei Chen,Jiwen Lu,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generated image detection, image detection, recent years, significant progress, image generation

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{this https URL}{this https URL}.

8. 【2604.21879】Addressing Image Authenticity When Cameras Use Generative AI

链接https://arxiv.org/abs/2604.21879

作者:Umar Masud,Abhijith Punnappurath,Luxi Zhao,David B. Lindell,Michael S. Brown

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:images shared online, image, photorealistically alter camera, methods to photorealistically, shared online

备注: To appear in CVPR 2026 Workshop on Authenticity and Provenance in the Age of Generative AI

点击查看摘要

Abstract:The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the 'unhallucinated' version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.

9. 【2604.21873】Grounding Video Reasoning in Physical Signals

链接https://arxiv.org/abs/2604.21873

作者:Alibay Osmanli,Zixu Cheng,Shaogang Gong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Physical video understanding, video understanding requires, Physical video, event correctly, understanding requires

备注: Benchmark for Grounding Video Reasoning in Physical Signals

点击查看摘要

Abstract:Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video QA reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.

10. 【2604.21814】Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

链接https://arxiv.org/abs/2604.21814

作者:Bowen Liu,Li Yang,Shanshan Song,Mingyu Tang,Zhifang Gao,Qifeng Chen,Yangqiu Song,Huimin Chen,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enables non-invasive gastrointestinal, leaving video-level analysis, video-level analysis underexplored, remains largely limited, Capsule endoscopy

备注

点击查看摘要

Abstract:Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

11. 【2604.21810】Multiscale Super Resolution without Image Priors

链接https://arxiv.org/abs/2604.21810

作者:Daniel Fu,Gabby Litterio,Pedro Felzenszwalb,Rashid Zia

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:address the ambiguities, super-resolution problem, pixel sizes, super-resolution, pixel

备注

点击查看摘要

Abstract:We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.

12. 【2604.21806】EMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

链接https://arxiv.org/abs/2604.21806

作者:Zixu Li,Yupeng Hu,Zhiheng Fu,Zhiwei Chen,Yongqi Li,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Image Retrieval, Composed Image, important image retrieval, image retrieval paradigm, Insufficient Entity Coverage

备注: Accepted by ACL 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at this https URL.

13. 【2604.21801】SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

链接https://arxiv.org/abs/2604.21801

作者:Safouane El Ghazouali,Nicola Venturi,Michael Rueegsegger,Umberto Michelucci

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advances, sensing rely heavily, large annotated datasets, tasks remains costly, acquiring high-quality ground

备注

点击查看摘要

Abstract:Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: this https URL.

14. 【2604.21786】From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

链接https://arxiv.org/abs/2604.21786

作者:Katharina Prasse,Steffen Jung,Isaac Bravo,Stefanie Walter,Patrick Knab,Christian Bartelt,Margret Keuper

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:communication strategies mobilise, strategies mobilise public, mobilise public concern, Social media platforms, Social media

备注

点击查看摘要

Abstract:Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at this https URL.

15. 【2604.21776】Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

链接https://arxiv.org/abs/2604.21776

作者:Avinash Paliwal,Adithya Iyer,Shivin Yadav,Muhammad Ali Afridi,Midhun Harikumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paired multi-view data, Precise camera control, Precise camera, severe scarcity, scarcity of paired

备注

点击查看摘要

Abstract:Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

16. 【2604.21772】Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

链接https://arxiv.org/abs/2604.21772

作者:Yingkai Yang,Chaoqi Chen,Hui Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Continual Test-Time Adaptation, Open-set Continual Test-Time, mitigate distributional shifts, term Open-set Continual, Test-Time Adaptation

备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.

17. 【2604.21760】Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

链接https://arxiv.org/abs/2604.21760

作者:Timothy Joseph Murphy,Jennifer Cook,Hélio Clemente José Cuve

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:strong benchmark performance, deep learning approaches, offer limited insight, benchmark performance, manipulated facial behavior

备注: Main paper: 19 pages, 5 figures, 4 tables. SI Appendix: 11 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.

18. 【2604.21743】Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

链接https://arxiv.org/abs/2604.21743

作者:Dat To-Thanh,Nghia Nguyen-Trong,Hoang Vo,Hieu Bui-Minh,Tinh-Anh Nguyen-Nhu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:fast processing speeds, processing speeds required, balance high output, Image enhancement models, image enhancement model

备注: 10 pages, 3 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features. Moreover, we incorporate Quantization-Aware Training (QAT) to simulate the effects of low-precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post-training quantization (PTQ). Experimental results demonstrate that the proposed method produces high-fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at this https URL.

19. 【2604.21728】Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

链接https://arxiv.org/abs/2604.21728

作者:Wenxuan Bao,Yanjun Zhao,Xiyuan Yang,Jingrui He

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Pretrained vision-language models, CLIP exhibit strong, Pretrained vision-language, CLIP exhibit, exhibit strong zero-shot

备注: Accepted by CVPR 2026 (Findings Track)

点击查看摘要

Abstract:Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at this https URL .

20. 【2604.21718】Building a Precise Video Language with Human-AI Oversight

链接https://arxiv.org/abs/2604.21718

作者:Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:dynamic visual world, Video-language models, learn to reason, natural language, world through natural

备注: CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)

点击查看摘要

Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL

21. 【2604.21713】Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

链接https://arxiv.org/abs/2604.21713

作者:Guangkai Xu,Hua Geng,Huanyi Zheng,Songyi Yin,Yanlong Sun,Hao Chen,Chunhua Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made rapid progress, recently made rapid, Feed-forward visual geometry, visual geometry estimation, rapid progress

备注: Accepted to CVPR 2026. GitHub Page: [this https URL](https://github.com/aim-uofa/CARVE)

点击查看摘要

Abstract:Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

22. 【2604.21712】Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

链接https://arxiv.org/abs/2604.21712

作者:Yang Liu,Zhiyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:monocular RGB images, RGB images aims, estimate anatomically plausible, monocular RGB, RGB images

备注

点击查看摘要

Abstract:3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.

23. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection

链接https://arxiv.org/abs/2604.21694

作者:Katarzyna Fojcik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation

备注

点击查看摘要

Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.

24. 【2604.21689】StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

链接https://arxiv.org/abs/2604.21689

作者:Kwan Yun,Changmin Lee,Ayeong Jeong,Youngseo Kim,Seungmi Lee,Junyong Noh

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

关键词:Creative face stylization, diverse visual idioms, Creative face, face stylization aims, retaining recognizable identity

备注: SIGGRAPH 2026 / ACM TOG. Project page at [this https URL](https://kwanyun.github.io/StyleID_page/)

点击查看摘要

Abstract:Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at this https URL

25. 【2604.21686】WorldMark: A Unified Benchmark Suite for Interactive Video World Models

链接https://arxiv.org/abs/2604.21686

作者:Xiaojie Xu,Zhengyuan Lin,Kang He,Yukang Feng,Xiaofeng Mao,Yuanyang Yin,Kaipeng Zhang,Yongtao Ge

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:making fair cross-model, Interactive video generation, cross-model comparison impossible, video generation models, fair cross-model comparison

备注

点击查看摘要

Abstract:Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (this http URL), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

26. 【2604.21681】Sapiens2

链接https://arxiv.org/abs/2604.21681

作者:Rawal Khirodkar,He Wen,Julieta Martinez,Yuan Dong,Su Zhaoen,Shunsuke Saito

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human-centric vision focused, focused on generalization, family of high-resolution, high-resolution transformers, transformers for human-centric

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: this https URL

27. 【2604.21668】Encoder-Free Human Motion Understanding via Structured Motion Descriptions

链接https://arxiv.org/abs/2604.21668

作者:Yao Zhang,Zhuchenyang Liu,Thomas Ploetz,Yu Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:text-based large language, advancing rapidly, human motion understanding, including motion question, text-based large

备注

点击查看摘要

Abstract:The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at this https URL.

28. 【2604.21654】Causal Disentanglement for Full-Reference Image Quality Assessment

链接https://arxiv.org/abs/2604.21654

作者:Zhen Zhang,Jielei Chu,Tian Zhang,Weide Liu,Fengmao Lv,Tianrui Li,Jun Cheng,Yuming Fang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep network-based full-reference, performing pairwise comparisons, Existing deep network-based, models typically work, network-based full-reference image

备注

点击查看摘要

Abstract:Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

29. 【2604.21631】DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

链接https://arxiv.org/abs/2604.21631

作者:Xu Wang,Zhiru Wang,Shiyun Xie,Chengwei Pan,Yisong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, real-time photorealistic rendering, violate multi-view consistency, achieves real-time photorealistic, performance degrades significantly

备注: 10 pages,6 figures, accepted to Computer Vision and Pattern Recognition Conference 2026

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.

30. 【2604.21627】DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

链接https://arxiv.org/abs/2604.21627

作者:Tahar Chettaoui,Eduarda Caldeira,Guray Ozgur,Raghavendra Ramachandra,Fadi Boutros,Naser Damer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:anticipate evolving threats, develop robust defensive, Advancing face morphing, robust defensive mechanisms, Advancing face

备注: Accepted At CVPR-W 2026

点击查看摘要

Abstract:Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.

31. 【2604.21617】Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

链接https://arxiv.org/abs/2604.21617

作者:Frederik L. Dennig,Daniel A. Keim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:produce unpredictable shifts, real time, analysts embed, input variations, variations from measurement

备注: 6 pages, 3 figures, LaTeX, to appear at the 17th International EuroVis Workshop on Visual Analytics

点击查看摘要

Abstract:Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding. Our approach combines quantitative measures of mean displacement, bias, and nearest-anchor assignment error with per-anchor visualizations of displacement vectors, local PCA ellipsoids, and Voronoi misassignment for detailed inspection. We demonstrate the framework's effectiveness on UMAP- and t-SNE-based neural projectors of varying network sizes and study the effect of Jacobian regularization as a gradient-based robustness strategy. We apply our framework to the MNIST and Fashion-MNIST datasets. The results show that our framework identifies unstable projection regions invisible to reconstruction error or neighborhood-preservation metrics.

32. 【2604.21592】Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

链接https://arxiv.org/abs/2604.21592

作者:Minghao Yin,Wenbo Hu,Jiale Xu,Ying Shan,Kai Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prohibitive computational demand, yielded remarkable progress, generation remains elusive, Recent breakthroughs, static shape synthesis

备注

点击查看摘要

Abstract:Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

33. 【2604.21575】OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

链接https://arxiv.org/abs/2604.21575

作者:Zeyu Cai,Yuliang Xiu,Renke Wang,Zhijing Shao,Xiaoben Li,Siyuan Yu,Chao Xu,Yang Liu,Baigui Sun,Jian Yang,Zhenyu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:clothed human assets, underlying body model, clothed human, extensively studied, approaches focus

备注: Project Page: [this https URL](https://zcai0612.github.io/OmniFit/)

点击查看摘要

Abstract:Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

34. 【2604.21573】CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction

链接https://arxiv.org/abs/2604.21573

作者:Changfan Wang,Xinran Wang,Donghai Liu,Fei Su,Lulu Sun,Zhicheng Zhao,Zhu Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:enables spatially resolved, limiting large-cohort studies, spatially resolved gene, resolved gene profiling, enables spatially

备注

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (HE) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.

35. 【2604.21572】Deep kernel video approximation for unsupervised action segmentation

链接https://arxiv.org/abs/2604.21572

作者:Silvia L. Pintea,Jouke Dijkstra

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unsupervised action segmentation, storing large datasets, per-video unsupervised action, action segmentation, unsupervised action

备注: Accepted at ICPR 2026

点击查看摘要

Abstract:This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

36. 【2604.21546】Component-Based Out-of-Distribution Detection

链接https://arxiv.org/abs/2604.21546

作者:Wenrui Liu,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detection requires sensitivity, natural In-Distribution, requires sensitivity, sensitivity to subtle, overreacting to natural

备注

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.

37. 【2604.21530】Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models

链接https://arxiv.org/abs/2604.21530

作者:Laura Valeria Perez-Herrera,M.J. Garcia-Gonzalez,Karen Lopez-Linares

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:influence treatment decisions, Lung adenocarcinoma, accurately identifying growth, grading depends, treatment decisions

备注

点击查看摘要

Abstract:Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.

38. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

链接https://arxiv.org/abs/2604.21523

作者:Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs

备注

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

39. 【2604.21519】Gmd: Gaussian mixture descriptor for pair matching of 3D fragments

链接https://arxiv.org/abs/2604.21519

作者:Meijun Xiong,Zhenguo Shi,Xinyu Zhou,Yuhe Zhang,Shunli Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Mixture Model, Gaussian Mixture Descriptor, Gaussian Mixture, reconstruct objects, fractured surfaces

备注: 24 pages, 10 figures. Published in Multimedia Systems

点击查看摘要

Abstract:In the automatic reassembly of fragments acquired using laser scanners to reconstruct objects, a crucial step is the matching of fractured surfaces. In this paper, we propose a novel local descriptor that uses the Gaussian Mixture Model (GMM) to fit the distribution of points, allowing for the description and matching of fractured surfaces of fragments. Our method involves dividing a local surface patch into concave and convex regions for estimating the k value of GMM. Then the final Gaussian Mixture Descriptor (GMD) of the fractured surface is formed by merging the regional GMDs. To measure the similarities between GMDs for determining adjacent fragments, we employ the L2 distance and align the fragments using Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP). The extensive experiments on real-scanned public datasets and Terracotta datasets demonstrate the effectiveness of our approach; furthermore, the comparisons with several existing methods also validate the advantage of the proposed method.

40. 【2604.21502】VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

链接https://arxiv.org/abs/2604.21502

作者:Yupeng Zhang,Ruize Han,Ningnan Guo,Wei Feng,Song Wang,Liang Wan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant domain shifts, single source domain, leading detectors trained, domain shifts, real-world scenarios

备注

点击查看摘要

Abstract:In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

41. 【2604.21479】Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

链接https://arxiv.org/abs/2604.21479

作者:Yanjiao Liu,Jiawei Liu,Xun Gong,Zifei Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large language models, attracted increasing research, increasing research attention, Large language, recently demonstrated strong

备注

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

42. 【2604.21478】Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

链接https://arxiv.org/abs/2604.21478

作者:Yuhan Luo,Tao Chen,Decheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly important role, visual data forgery, forgery detection plays, generative models, textbf

备注: The source code is available at [this https URL](https://github.com/Yuhan-Luo/Semantic-Fine-grained-Alignment-and-Mixture-of-Experts)

点击查看摘要

Abstract:Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

43. 【2604.21465】ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation

链接https://arxiv.org/abs/2604.21465

作者:Junyan Luo,Peipeng Yu,Jianwei Fei,Shiya Zeng,Xiaoyu Zhou,Zhihua Xia,Xiang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digital security, technologies have rapidly, rapidly advanced, advanced with modern, modern generative

备注

点击查看摘要

Abstract:Deepfake technologies have rapidly advanced with modern generative AI, and face swapping in particular poses serious threats to privacy and digital security. Existing proactive defenses mostly rely on pixel-level perturbations, which are ineffective against contemporary swapping models that extract robust high-level identity embeddings. We propose ID-Eraser, a feature-space proactive defense that removes identifiable facial information to prevent malicious face swapping. By injecting learnable perturbations into identity embeddings and reconstructing natural-looking protection images through a Face Revive Generator (FRG), ID-Eraser produces visually realistic results for humans while rendering the protected identities unusable for Deepfake models. Experiments show that ID-Eraser substantially disrupts identity recognition across diverse face recognition and swapping systems under strict black-box settings, achieving the lowest Top-1 accuracy (0.30) with the best FID (1.64) and LPIPS (0.020). Compared with swaps generated from clean inputs, the identity similarity of protected swaps drops sharply to an average of 0.504 across five representative face swapping models. ID-Eraser further demonstrates strong cross-dataset generalization, robustness to common distortions, and practical effectiveness on commercial APIs, reducing Tencent API similarity from 0.76 to 0.36.

44. 【2604.21461】Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

链接https://arxiv.org/abs/2604.21461

作者:Chentao Li,Zirui Gao,Mingze Gao,Yinglian Ren,Jianjiang Feng,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:natural language commands, resolve referential ambiguities, Multimodal Large Language, Large Language Models, smart glasses

备注: 20 pages, 14 figures. Committed to ACL 2026

点击查看摘要

Abstract:Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: this https URL

45. 【2604.21453】Instance-level Visual Active Tracking with Occlusion-Aware Planning

链接https://arxiv.org/abs/2604.21453

作者:Haowei Sun,Kai Zhou,Hao Gao,Shiteng Zhang,Jinwu Hu,Xutao Wen,Qixiang Ye,Mingkui Tan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Active Tracking, Visual Active, aims to control, security surveillance, control cameras

备注: CVPR 2026 Poster

点击查看摘要

Abstract:Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

46. 【2604.21450】VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

链接https://arxiv.org/abs/2604.21450

作者:Yixuan Zhu,Shilin Ma,Haolin Wang,Ao Li,Yanzhe Jing,Yansong Tang,Lei Chen,Jiwen Lu,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Recent advancements, real-world image super-resolution, highlighting their potential, advancements in visual, demonstrated their effectiveness

备注: Accepted in ICLR 2026. Code is available at [this https URL](https://github.com/EternalEvan/VARestorer)

点击查看摘要

Abstract:Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

47. 【2604.21442】2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing

链接https://arxiv.org/abs/2604.21442

作者:Shurui Wang,Yuhe Zhang,Ruizhe Guo,Yaning Zhang,Yifei Xie,Xinyu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presenting significant challenges, point cloud models, point cloud processing, massive point cloud, Kd-tree and Octree

备注: 13 pages, 13 figures. Published in The Computer Journal

点击查看摘要

Abstract:The development of 3D scanning technology has enabled the acquisition of massive point cloud models with diverse structures and large scales, thereby presenting significant challenges in point cloud processing. Fast neighboring points search is one of the most common problems, which is frequently used in model reconstruction, classification, retrieval and feature visualization. Hash function is well known for its high-speed and accurate performance in searching high-dimensional data, which is also the core of the proposed 2L-LSH. Specifically, the 2L-LSH algorithm adopts a two-step hash function strategy, in which the popular step divides the bounding box of the point cloud model and the second step constructs a generalized table-based data structure. The proposed 2L-LSH offers a highly efficient and accurate solution for fast neighboring points search in large-scale 3D point cloud models, making it a promising technique for various applications in the field. The proposed algorithm is compared with the well-known methods including Kd-tree and Octree; the obtained results demonstrated that the proposed method outperforms Kd-tree and Octree in terms of speed, i.e. the time consumption of kNN search can be 51.111% and 94.159% lower than Kd-tree and Octree, respectively. And the RN search time can be 54.519% and 41.840% lower than Kd-tree and Octree, respectively.

48. 【2604.21435】UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

链接https://arxiv.org/abs/2604.21435

作者:Jingfang Li,Haoran Zhu,Wen Yang,Jinrui Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:modern remote sensing, offering unprecedented spatial, remote sensing, offering unprecedented, essential for modern

备注

点击查看摘要

Abstract:Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at this https URL.

49. 【2604.21422】Pre-process for segmentation task with nonlinear diffusion filters

链接https://arxiv.org/abs/2604.21422

作者:Javier Sanguino,Carlos Platero,Olga Velasco

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:nonlinear diffusion, nonlinear diffusion equation, Toggle, nonlinear diffusion filters, diffusion

备注: Manuscript from 2017, previously unpublished, 37 pages

点击查看摘要

Abstract:This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at this https URL.

Comments:
Manuscript from 2017, previously unpublished, 37 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68U10 (Image processing), 68T45 (Machine vision and scene understanding), 65M06 (Finite difference methods)

Cite as:
arXiv:2604.21422 [cs.CV]

(or
arXiv:2604.21422v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.21422

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Carlos Platero PhD [view email] [v1]
Thu, 23 Apr 2026 08:38:45 UTC (1,261 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Pre-process for segmentation task with nonlinear diffusion filters, by Javier Sanguino and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

prev

|
next

new
|
recent
| 2026-04

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

50. 【2604.21409】S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

链接https://arxiv.org/abs/2604.21409

作者:Qingxiao Li,Lifeng Xu,QingLi Wang,Yudong Bai,Mingwei Ou,Shu Hu,Nan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Python code execution, complementary reasoning paradigms, actively manipulate images, Python code, relies on structured

备注: 29 pages, 13 figures

点击查看摘要

Abstract:We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.

51. 【2604.21400】You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes

链接https://arxiv.org/abs/2604.21400

作者:Jinrang Jia,Zhenjia Li,Yifeng Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:revolutionized neural rendering, existing methods remain, methods remain predominantly, remain predominantly research, predominantly research prototypes

备注: 17 pages, 5 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has revolutionized neural rendering, yet existing methods remain predominantly research prototypes ill-suited for production-level deployment. We identify a critical "Industry-Academia Gap" hindering real-world application: unpredictable resource consumption from heuristic Gaussian growth, the "sparsity shield" of current benchmarks that rewards hallucination over physical fidelity, and severe multi-sensor data pollution. To bridge this gap, we propose YOGO (You Only Gaussian Once), a system-level framework that reformulates the stochastic growth process into a deterministic, budget-aware equilibrium. YOGO integrates a novel budget controller for hardware-constrained resource allocation and an availability-registration protocol for robust multi-sensor fusion. To push the boundaries of reconstruction fidelity, we introduce Immersion v1.0, the first ultra-dense indoor dataset specifically designed to break the "sparsity shield." By providing saturated viewpoint coverage, Immersion v1.0 forces algorithms to focus on extreme physical fidelity rather than viewpoint interpolation, and enables the community to focus on the upper limits of high-fidelity reconstruction. Extensive experiments demonstrate that YOGO achieves state-of-the-art visual quality while maintaining a strictly deterministic profile, establishing a new standard for production-grade 3DGS. To facilitate reproducibility, part scenes of Immersion v1.0 dataset and source code of YOGO has been publicly released. The project link is this https URL.

52. 【2604.21396】VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

链接https://arxiv.org/abs/2604.21396

作者:Byeonggeuk Lim,Kyeonghyun Kim,JungMin Yun,YoungBin Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision-Language Models, requires precise local, precise local region-based, advancement of Large, Large Vision-Language

备注: Accepted to LREC 2026

点击查看摘要

Abstract:The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.

53. 【2604.21395】Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

链接https://arxiv.org/abs/2604.21395

作者:Vishal Rajput

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:minimises supervised loss, retain non-zero Jacobian, non-zero Jacobian sensitivity, empirical risk minimisation, risk minimisation

备注: 29 pages. Code: [this https URL](https://github.com/vishalstark512/PMH) . Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium

点击查看摘要

Abstract:We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivity in directions that are label-correlated in training data but nuisance at test time. This is not a contingent failure of current methods; it is a mathematical consequence of the supervised objective itself. We call this the geometric blind spot of supervised learning (Theorem 1), and show it holds across proper scoring rules, architectures, and dataset sizes. This single theorem unifies four lines of prior empirical work that were previously treated separately: non-robust predictive features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. In this framing, adversarial vulnerability is one consequence of a broader structural fact about supervised learning geometry. We introduce Trajectory Deviation Index (TDI), a diagnostic that measures the theorem's bounded quantity directly, and show why common alternatives miss the key failure mode. PGD adversarial training reaches Jacobian Frobenius 2.91 yet has the worst clean-input geometry (TDI 1.336), while PMH achieves TDI 0.904. TDI is the only metric that detects this dissociation because it measures isotropic path-length distortion -- the exact quantity Theorem 1 bounds. Across seven vision tasks, BERT/SST-2, and ImageNet ViT-B/16 backbones used by CLIP, DINO, and SAM, the blind spot is measurable and repairable. It is present at foundation-model scale, worsens monotonically across language-model sizes (blind-spot ratio 0.860 to 0.765 to 0.742 from 66M to 340M), and is amplified by task-specific ERM fine-tuning (+54%), while PMH repairs it by 11x with one additional training term whose Gaussian form Proposition 5 proves is the unique perturbation law that uniformly penalises the encoder Jacobian.

Comments:
29 pages. Code: this https URL. Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68T05, 68T45

ACMclasses:
I.2.6; I.2.10

Cite as:
arXiv:2604.21395 [cs.LG]

(or
arXiv:2604.21395v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.21395

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Vishal Rajput [view email] [v1]
Thu, 23 Apr 2026 08:03:33 UTC (69 KB)

54. 【2604.21387】EdgeFormer: local patch-based edge detection transformer on point clouds

链接https://arxiv.org/abs/2604.21387

作者:Yifei Xie,Zhikun Tu,Tong Yang,Yuhe Zhang,Xinyu Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:edge detection, commercial demands, vision applications, industrial and commercial, Edge

备注: 22 pages, 9 figures. Published in Pattern Analysis and Applications

点击查看摘要

Abstract:Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.

55. 【2604.21362】KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

链接https://arxiv.org/abs/2604.21362

作者:Linkai Liu,Wei Feng,Xi Zhao,Shen Zhang,Xingye Chen,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Yuchen Zhou,Zipeng Guo,Chao Gou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Creative Video Generation, automatically produce advertising, highlights product features, leverages generative models, produce advertising content

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model's comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG's superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at this https URL.

56. 【2604.21360】Prototype-Based Test-Time Adaptation of Vision-Language Models

链接https://arxiv.org/abs/2604.21360

作者:Zhaohong Huang,Yuxin Zhang,Wenjing Liu,Fei Chao,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision-language models, bridge the distribution, distribution gap, gap between pre-training, Test-time adaptation

备注

点击查看摘要

Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.

57. 【2604.21356】SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes

链接https://arxiv.org/abs/2604.21356

作者:Nannan Qin,Pengjie Tao,Haiyan Guan,Zhizhong Kang,Lingfei Ma,Xiangyun Hu,Jonathan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-quality digital terrain, airborne laser scanning, generation typically relies, separate point clouds, High-quality digital

备注

点击查看摘要

Abstract:High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.

58. 【2604.21349】rust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning

链接https://arxiv.org/abs/2604.21349

作者:Wadii Boulila,Adel Ammar,Bilel Benjdira,Maha Driss

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:Self-supervised learning, representation learning, aerial imagery, learning, Self-supervised

备注: 17 pages

点击查看摘要

Abstract:Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per-sample, per-factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop-gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive-residual approach improves it. Using a 200-epoch protocol on a 210,000-image corpus, the method achieves the highest mean linear-probe accuracy among six backbones on EuroSAT, AID, and NWPU-RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information-erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero-shot cross-domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive-residual formulation is the primary source of these improvements. An evidential variant using Dempster-Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty-aware SSL. Code is publicly available at this https URL.

59. 【2604.21346】Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

链接https://arxiv.org/abs/2604.21346

作者:Mohit Vaishnav,Tanel Tammet

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:main bottleneck lies, Bongard problems, language models, large language models, raising the question

备注

点击查看摘要

Abstract:Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

60. 【2604.21344】Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

链接https://arxiv.org/abs/2604.21344

作者:Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:present complex information, complex information, present complex, Multimodal Language Models, multiple related charts

备注

点击查看摘要

Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

61. 【2604.21343】Latent Denoising Improves Visual Alignment in Large Multimodal Models

链接https://arxiv.org/abs/2604.21343

作者:Dhruv Parikh,Jacob Fein-Ashley,Rajgopal Kannan,Viktor Prasanna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Multimodal Models, language modeling objective, autoregressive language modeling, Large Multimodal, Multimodal Models

备注: Technical Report

点击查看摘要

Abstract:Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at this https URL.

62. 【2604.21330】acher-Guided Routing for Sparse Vision Mixture-of-Experts

链接https://arxiv.org/abs/2604.21330

作者:Masahiro Kada,Ryota Yoshihashi,Satoshi Ikehata,Rei Kawakami,Ikuro Sato

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:resulting computational cost, Recent progress, increasingly large-scale models, critical bottleneck, progress in deep

备注

点击查看摘要

Abstract:Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

63. 【2604.21326】MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

链接https://arxiv.org/abs/2604.21326

作者:Juan Li,Chuanghao Ding,Xujie Zhang,Cam-Tu Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Universal Multimodal Retrieval, multi-modal retrieval, shared embedding space, Universal Multimodal, Multimodal Retrieval

备注

点击查看摘要

Abstract:Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model's tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.

64. 【2604.21324】mporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

链接https://arxiv.org/abs/2604.21324

作者:Zhiyong Li,Wei Jiang,Haojie Liu,Mingyu Wang,Wanchong Xu,Weijie Mao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visible-infrared person re-identification, methods predominantly focus, Visible-infrared person, existing methods predominantly, costly identity annotations

备注

点击查看摘要

Abstract:Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

65. 【2604.21321】FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment

链接https://arxiv.org/abs/2604.21321

作者:Khaled R Ahmed,Toqi Tahamid Sarker,Taminul Islam,Tamany M Alanezi,Amer AbuGhazaleh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:current practice relies, destructive wet-chemistry assays, Monitoring frying oil, frying oil degradation, food safety

备注: 10 pages, 7 figures, this paper has been submitted and accepted for publication at CVPRW 2026

点击查看摘要

Abstract:Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.

66. 【2604.21313】PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring

链接https://arxiv.org/abs/2604.21313

作者:Yongying Liu,Jiaqi Wang,Jian Song,Xinlei Shao,Yijia Chen,Nan Xu,Katsunori Mizuno,Shigeru Tabeta,Fan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:simple item counts, Accurate quantification, Litter Area Segmentor, physical exposure area, Pixel-level Litter Area

备注: 30 pages, 12 figures

点击查看摘要

Abstract:Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-based monitoring predominantly relies on bounding-box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS-Net (Pixel-level Litter Area Segmentor), an instance segmentation framework that extracts pixel-accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon-driven pocket beach in Koh Tao, Thailand, PLAS-Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power-law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area-weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance-area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel-level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.

67. 【2604.21312】he First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview

链接https://arxiv.org/abs/2604.21312

作者:Kai Liu,Haoyang Yue,Zeli Lin,Zheng Chen,Jingkai Wang,Jue Gong,Jiatong Li,Xianglong Yan,Libo Zhu,Jianze Li,Ziqing Zhang,Zihan Zhou,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Junye Chen,Zhenming Yan,Yucong Hong,Ruize Han,Song Wang,Li Pang,Heng Zhao,Xinqiao Wu,Deyu Meng,Xiangyong Cao,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Yihang Chen,Yifan Deng,Zengyuan Zuo,Junjun Jiang,Saiprasad Meesiyawar,Sulocha Yatageri,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Cici Liu,Tongyao Mu,Qiong Cao,Yifan Wang,Kosuke Shigematsu,Hiroto Shirono,Asuka Shin,Wei Zhou,Linfeng Li,Lingdong Kong,Ce Wang,Xingwei Zhong,Wanjie Sun,Dafeng Zhang,Hongxin Lan,Qisheng Xu,Mingyue He,Hui Geng,Tianjiao Wan,Kele Xu,Changjian Wang,Antoine Carreaud,Nicola Santacroce,Shanci Li,Jan Skaloud,Adrien Gressin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Infrared Image Super-Resolution, Sensing Infrared Image, Infrared Image, Remote Sensing Infrared, presents the NTIRE

备注: Github Repo: [this https URL](https://github.com/Kai-Liu001/NTIRE2026_infraredSR)

点击查看摘要

Abstract:This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infrared data and practical application needs, the challenge adopts a single-track setting. A total of 115 participants registered for the competition, with 13 teams submitting valid entries. This report summarizes the challenge design, dataset, evaluation protocol, main results, and the representative methods of each team. The challenge serves as a benchmark to advance research in infrared image super-resolution and promote the development of effective solutions for real-world remote sensing applications.

68. 【2604.21311】an interpretable vision transformer framework for automated brain tumor classification

链接https://arxiv.org/abs/2604.21311

作者:Chinedu Emmanuel Mbonu,Tochukwu Sunday Belonwu,Okwuchukwu Ejike Chukwuogo,Kenechukwu Sylvanus Anigbogu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical neurological conditions, Magnetic Resonance Imaging, patient survival rates, Brain tumors represent, neurological conditions

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines

69. 【2604.21291】Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation

链接https://arxiv.org/abs/2604.21291

作者:Yuanchen Fei,Yude Zou,Zejian Kang,Ming Li,Jiaying Zhou,Xiangru Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Controllable human video, human video generation, privacy safe human, modeling remains underexplored, remains underexplored due

备注

点击查看摘要

Abstract:Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied this http URL, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex this http URL data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real this http URL this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity this http URL study offers the first comprehensive exploration of synthetic data's role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.

70. 【2604.21290】GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA

链接https://arxiv.org/abs/2604.21290

作者:Anvitha Ramachandran,Dhruv Parikh,Viktor Prasanna

类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:Graph Neural Networks, Neural Networks, Vision Graph Neural, graph construction, Graph Neural

备注: FCCM 2026

点击查看摘要

Abstract:Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50--95\% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer's features, while simultaneously using the current layer's features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.

Comments:
FCCM 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

Cite as:
arXiv:2604.21290 [cs.CV]

(or
arXiv:2604.21290v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.21290

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2604.21289】AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing

链接https://arxiv.org/abs/2604.21289

作者:Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:preserving attribute-irrelevant content, modify target attributes, aims to modify, modify target, preserving attribute-irrelevant

备注

点击查看摘要

Abstract:Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.

72. 【2604.21280】ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing

链接https://arxiv.org/abs/2604.21280

作者:Jebacyril Arockiaraj,Dhruv Parikh,Viktor Prasanna

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incurring substantial compute, non-stationary data streams, existing methods rely, exemplar-heavy classifiers, incurring substantial

备注: FCCM 2026

点击查看摘要

Abstract:On-device continual learning (CL) is critical for edge AI systems operating on non-stationary data streams, but most existing methods rely on backpropagation or exemplar-heavy classifiers, incurring substantial compute, memory, and latency overheads. Hyperdimensional computing (HDC) offers a lightweight alternative through fast, non-iterative online updates. Combined with a compact convolutional neural network (CNN) feature extractor, HDC enables efficient on-device adaptation with strong visual representations. However, prior HDC-based CL systems often depend on multi-tier memory hierarchies and complex cluster management, limiting deployability on resource-constrained hardware. We present ImageHD, an FPGA accelerator for on-device continual learning of visual data based on HDC. ImageHD targets streaming CL under strict latency and on-chip memory constraints, avoiding costly iterative optimization. At the algorithmic level, we introduce a hardware-aware CL method that bounds class exemplars through a unified exemplar memory and a hardware-efficient cluster merging strategy, while incorporating a quantized CNN front-end to reduce deployment overhead without sacrificing accuracy. At the system level, ImageHD is implemented as a streaming dataflow architecture on the AMD Zynq ZCU104 FPGA, integrating HDC encoding, similarity search, and bounded cluster management using word-packed binary hypervectors for massively parallel bitwise computation within tight on-chip resource budgets. On CORe50, ImageHD achieves up to 40.4x (4.84x) speedup and 383x (105.1x) energy efficiency over optimized CPU (GPU) baselines, demonstrating the practicality of HDC-enabled continual learning for real-time edge AI.

Comments:
FCCM 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.21280 [cs.CV]

(or
arXiv:2604.21280v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.21280

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2604.21279】LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

链接https://arxiv.org/abs/2604.21279

作者:Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial attribute editing, crucial for applications, applications like virtual, virtual avatars, avatars and photo

备注

点击查看摘要

Abstract:Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model's design choices.

74. 【2604.21268】Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

链接https://arxiv.org/abs/2604.21268

作者:Wenkai Wang,Xiyun Li,Hongcan Guo,Wenhao Yu,Tianqing Fang,Haitao Mi,Dong Yu,Shengyu Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Graphical User Interface, Graphical User, requires mapping natural, mapping natural language, natural language instructions

备注

点击查看摘要

Abstract:Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model's grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer's outputs enhances critic robustness, while the critic's maturing discrimination capability conversely unlocks the proposer's potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.

75. 【2604.21227】UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection

链接https://arxiv.org/abs/2604.21227

作者:Yuze Li,Zhilei Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Facial action unit, AU-specific uncertainties arising, Facial action, detection remains challenging, action unit

备注: Accepted by ICMR 2026

点击查看摘要

Abstract:Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.

76. 【2604.21221】Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

链接https://arxiv.org/abs/2604.21221

作者:Boxun Xu,Yuming Du,Zichang Liu,Siyu Yang,Ziyang Jiang,Siqi Yan,Rajasi Saha,Albert Pumarola,Wenchen Wang,Peng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:introduce Sparse Forcing, video diffusion models, Sparse Forcing, reducing decoding latency, autoregressive video diffusion

备注

点击查看摘要

Abstract:We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.

77. 【2604.21199】ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

链接https://arxiv.org/abs/2604.21199

作者:Stephan Xie,Ben Cohen,Mononito Goswami,Junhong Shen,Emaad Khwaja,Chenghao Liu,David Asker,Othmane Abou-Amal,Ameet Talwalkar

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Time series question-answering, Time series, natural language questions, time series FMs, time series anomalies

备注

点击查看摘要

Abstract:Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at this https URL.

78. 【2604.21198】A Probabilistic Framework for Improving Dense Object Detection in Underwater Image Data via Annealing-Based Data Augmentation

链接https://arxiv.org/abs/2604.21198

作者:Eleanor Wiesler,Trace Baxley

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:underwater settings characterized, models typically perform, performance degrades substantially, water clarity, stable lighting

备注

点击查看摘要

Abstract:Object detection models typically perform well on images captured in controlled environments with stable lighting, water clarity, and viewpoint, but their performance degrades substantially in real-world underwater settings characterized by high variability and frequent occlusions. In this work, we address these challenges by introducing a novel data augmentation framework designed to improve robustness in dense and unconstrained underwater scenes. Using the DeepFish dataset, which contains images of fish in natural environments, we first generate bounding box annotations from provided segmentation masks to construct a custom detection dataset. We then propose a pseudo-simulated annealing-based augmentation algorithm, inspired by the copy-paste strategy of Deng et al. [1], to synthesize realistic crowded fish scenarios. Our approach improves spatial diversity and object density during training, enabling better generalization to complex scenes. Experimental results show that our method significantly outperforms a baseline YOLOv10 model, particularly on a challenging test set of manually annotated images collected from live-stream footage in the Florida Keys. These results demonstrate the effectiveness of our augmentation strategy for improving detection performance in dense, real-world underwater environments.

79. 【2604.21190】SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

链接https://arxiv.org/abs/2604.21190

作者:Chan Yeong Hwang,Miso Choi,Sunghyun On,Jinkyu Kim,Jungbeom Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding visual scenes, Understanding visual, visual scenes requires, spatial reasoning, spatial reasoning requires

备注: Technical report

点击查看摘要

Abstract:Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf{\textsc{SpatiO}}, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbf{Test-Time Orchestration (TTO)}, an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textsc{SpatiO} consistently improves spatial reasoning performance over both closed-source and open-source baselines.

80. 【2604.21182】WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

链接https://arxiv.org/abs/2604.21182

作者:Yuki Fujimura,Takahiro Kushida,Kazuya Kitano,Takuya Funatomi,Yasuhiro Mukaigawa

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, unknown camera parameters, camera parameters, Splatting, unknown camera

备注: Project page: [this https URL](https://github.com/yfujimura/WildSplatter)

点击查看摘要

Abstract:We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene representation that enables high-quality, real-time rendering; however, it typically requires iterative optimization and multi-view images captured under consistent lighting with known camera parameters. WildSplatter is trained on unconstrained photo collections and jointly learns 3D Gaussians and appearance embeddings conditioned on input images. This design enables flexible modulation of Gaussian colors to represent significant variations in lighting and appearance. Our method reconstructs 3D Gaussians from sparse input views in under one second, while also enabling appearance control under diverse lighting conditions. Experimental results demonstrate that our approach outperforms existing pose-free 3DGS methods on challenging real-world datasets with varying illumination.

81. 【2604.21160】Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment

链接https://arxiv.org/abs/2604.21160

作者:Jingkun Chen,Ruoshi Xu,Mingqi Gao,Shengda Luo,Jungong Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:empower embodied agents, executable spatial reasoning, Models promise, hallucination where predicted, structures contradict

备注: 10 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.

82. 【2604.21146】WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis

链接https://arxiv.org/abs/2604.21146

作者:Yalcin Tur,Mihajlo Stojkovic,Ulas Bagci

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable quality, limits clinical deployment, multi-modal MRI synthesis, computational cost, hundreds of sampling

备注: 17 pages, 4 figures, 3 tables. Accepted at MIDL 2026 (Poster)

点击查看摘要

Abstract:Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. We observe that this inefficiency stems from an unnecessary starting point: diffusion begins from pure noise, discarding the structural information already present in available MRI sequences. We propose WFM (Wavelet Flow Matching), which instead learns a direct flow from an informed prior, the mean of conditioning modalities in wavelet space, to the target distribution. Because the source and target share underlying anatomy and differ primarily in contrast, this formulation enables accurate synthesis in just 1-2 integration steps. A single 82M-parameter model with class conditioning synthesizes all four BraTS modalities (T1, T1c, T2, FLAIR), replacing four separate diffusion models totaling 326M parameters. On BraTS 2024, WFM achieves 26.8 dB PSNR and 0.94 SSIM, within 1-2 dB of diffusion baselines, while running 250-1000x faster (0.16-0.64s vs. 160s per volume). This speed-quality trade-off makes real-time MRI synthesis practical for clinical workflows. Code is available at this https URL.

83. 【2604.21127】HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping

链接https://arxiv.org/abs/2604.21127

作者:Zahid Hassan Tushar,Sanjay Purushotham

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:influence Earth climate, NASA PACE mission, Ocean Color Instrument, ocean color, influence Earth

备注: 15 pages, 8 figures, to be published in CVPR 2026 findings, Code and data are publicly available on [this https URL](https://github.com/umbc-sanjaylab/HyperFM)

点击查看摘要

Abstract:The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth's climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.

84. 【2604.21119】Materialistic RIR: Material Conditioned Realistic RIR Generation

链接https://arxiv.org/abs/2604.21119

作者:Mahnoor Fatima Saad,Sagnik Majumder,Kristen Grauman,Ziad Al-Halah

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词:Rings like gold, thuds like wood, spatial layout, spatial, Room Impulse Response

备注: Accepted to CVPR 2026 Findings. Project page: [this https URL](https://mahnoor-fatima-saad.github.io/MatRIR.html)

点击查看摘要

Abstract:Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance, a room with wooden walls will produce a different acoustic experience from a room with the same spatial layout but concrete walls. Accurately modeling these effects is essential for applications such as virtual reality, robotics, architectural design, and audio engineering. Yet, existing methods for acoustic modeling often entangle spatial and material influences in correlated representations, which limits user control and reduces the realism of the generated acoustics. In this work, we present a novel approach for material-controlled Room Impulse Response (RIR) generation that explicitly disentangles the effects of spatial and material cues in a scene. Our approach models the RIR using two modules: a spatial module that captures the influence of the spatial layout of the scene, and a material module that modulates this spatial RIR according to a user-specified material configuration. This explicitly disentangled design allows users to easily modify the material configuration of a scene and observe its impact on acoustics without altering the spatial structure or scene content. Our model provides significant improvements over prior approaches on both acoustic-based metrics (up to +16% on RTE) and material-based metrics (up to +70%). Furthermore, through a human perceptual study, we demonstrate the improved realism and material sensitivity of our model compared to the strongest baselines.

85. 【2604.21104】Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

链接https://arxiv.org/abs/2604.21104

作者:Amandeep Kaur,Mirali Purohit,Gedeon Muhawenayo,Esther Rolf,Hannah Kerner

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:geospatial foundation models, foundation models introduce, pretraining, pretraining dataset, pretraining datasets

备注: Accepted at EarthVision workshop, CVPR 2026

点击查看摘要

Abstract:New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at this https URL.

86. 【2604.21102】Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

链接https://arxiv.org/abs/2604.21102

作者:Siyuan Yao,Siavash Ghorbany,Kuangshi Ai,Arnav Cherukuthota,Meghan Forstchen,Alexis Korotasz,Matthew Sisk,Ming Hu,Chaoli Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Google Street View, leveraging large language, Street View, United States, Google Street

备注

点击查看摘要

Abstract:We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

87. 【2604.21079】Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

链接https://arxiv.org/abs/2604.21079

作者:Juhong Min,Lazar Valkov,Vitali Petsiuk,Hossein Souri,Deen Dayal Mohan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high compute overhead, count incurs high, incurs high compute, visual-token count incurs, compute overhead

备注

点击查看摘要

Abstract:Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.

88. 【2604.21066】Optimizing Diffusion Priors with a Single Observation

链接https://arxiv.org/abs/2604.21066

作者:Frederic Wang,Katherine L. Bouman

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Methodology (stat.ME)

关键词:purely simulated data, limited training sets, generate high-quality posterior, high-quality posterior samples, priors generate high-quality

备注

点击查看摘要

Abstract:While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.

89. 【2604.21060】Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

链接https://arxiv.org/abs/2604.21060

作者:Joakim Nguyen,Jian Yu,Jinrui Fang,Nicholas Konz,Tianlong Chen,Sanjay Krishnan,Chandra Krishnan,Ying Ding,Hairong Wang,Ankita Shukla

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents unique challenges, severe data scarcity, including severe data, pediatric brain tumor, Accurate diagnosis

备注: Accepted at the IEEE International Conference on Healthcare Informatics (ICHI), 2026

点击查看摘要

Abstract:Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

90. 【2604.21053】Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

链接https://arxiv.org/abs/2604.21053

作者:Fatemeh Ziaeetabar

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Robotic systems operating, Robotic systems, Semantic Event Chains, object interactions evolve, enriched Semantic Event

备注

点击查看摘要

Abstract:Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.

91. 【2604.21052】StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

链接https://arxiv.org/abs/2604.21052

作者:Liqi Jing,Dingming Zhang,Peinian Li,Lichen Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Autoregressive Modeling, discrete sequence modeling, learned latent space, conditional discrete sequence, sequence modeling

备注

点击查看摘要

Abstract:We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

92. 【2604.21041】Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks

链接https://arxiv.org/abs/2604.21041

作者:Aljalila Aladawi,Mohammed Talha Alam,Fakhri Karray

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:selectively remove undesirable, Machine unlearning, remove undesirable concepts, diffusion models aims, costly retraining

备注

点击查看摘要

Abstract:Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakness: erased concepts return when the model is fine-tuned on downstream data, even when that data is entirely unrelated. We adapt Projected Gradient Unlearning (PGU) from classification to the diffusion domain as a post-hoc hardening step. By constructing a Core Gradient Space (CGS) from the retain concept activations and projecting gradient updates into its orthogonal complement, PGU ensures that subsequent fine-tuning cannot undo the achieved erasure. Applied on top of existing methods (ESD, UCE, Receler), the approach eliminates revival for style concepts and substantially delays it for object concepts, running in roughly 6 minutes versus the ~2 hours required by Meta-Unlearning. PGU and Meta-Unlearning turn out to be complementary: which performs better depends on how the concept is encoded, and retain concept selection should follow visual feature similarity rather than semantic grouping.

93. 【2604.21032】Unlocking Multi-Spectral Data for Multi-Modal Models with Guided Inputs and Chain-of-Thought Reasoning

链接https://arxiv.org/abs/2604.21032

作者:Dahun Kim,Ganesh Satish Mallya,Anelia Angelova

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Remote Sensing applications, valuable input signal, environmental monitoring, Remote Sensing, land-use and land-cover

备注: Accepted to IGARSS 2026

点击查看摘要

Abstract:Multi-spectral imagery is a valuable input signal for Remote Sensing applications, such as land-use and land-cover classification and environmental monitoring. However, generalist Large Multi-modal Models (LMMs) are typically trained on RGB images, limiting their applicability to the RGB domain. At the same time, training multi-spectral multi-modal models is expensive and produces uniquely specialized models. To address this, we propose a novel training-free approach that introduces multi-spectral data within the inference pipeline of standard RGB-only LMMs, allowing large gains in performance. Our approach leverages the LMMs' understanding of the visual space by adapting non-RGB inputs to that space and injecting domain-specific information and Chain-of-Thought reasoning as instructions. We demonstrate this with the Gemini 2.5 model and observe strong Zero-Shot performance gains on popular Remote Sensing benchmarks. These results highlight the potential for geospatial professionals to leverage powerful generalist models for specialized sensor inputs, benefiting from rich reasoning capabilities grounded in specialized data.

94. 【2604.21028】A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment

链接https://arxiv.org/abs/2604.21028

作者:Christian Lammers,Fernando Arévalo,Leonie Märker-Neuhaus,Daniel Heinenberg,Christian Förster,Karl-Heinz Spies

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:global flood events, flood events highlights, reliable flood prediction, flood prediction tools, global flood

备注: 18 Pages, 9 Figures

点击查看摘要

Abstract:The increasing frequency and severity of global flood events highlights the need for the development of rapid and reliable flood prediction tools. This process traditionally relies on computationally expensive hydraulic simulations. This research presents a prediction tool by developing a deep-learning based surrogate model to accurately and efficiently predict the maximum water level across a grid. This was achieved by conducting a series of experiments to optimize a U-Net architecture, patch generation, and data handling for approximating a hydraulic model. This research demonstrates that a deep learning surrogate model can serve as a computationally efficient alternative to traditional hydraulic simulations. The framework was tested using hydraulic simulations of the Wupper catchment in the North-Rhein Westphalia region (Germany), obtaining comparable results.

95. 【2604.21011】Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition

链接https://arxiv.org/abs/2604.21011

作者:Naga VS Raviteja Chappa,Evangelos Sariyanidi,Lisa Yankowitz,Gokul Nair,Casey J. Zampella,Robert T. Schultz,Birkan Tunç

类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词:localized movements lasting, localized movements, tapping fingers, movements lasting, scratching one head

备注: Accepted to International Conference on Automatic Face and Gesture Recognition (FG)

点击查看摘要

Abstract:Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one's head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.

96. 【2604.21008】Linear Image Generation by Synthesizing Exposure Brackets

链接https://arxiv.org/abs/2604.21008

作者:Yuekun Dai,Zhoutong Zhang,Shangchen Zhou,Nanxuan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dynamic range, image signal processing, pipeline to produce, sophisticated image signal, photo begins

备注: accepted by CVPR2026

点击查看摘要

Abstract:The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.

97. 【2604.20983】hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

链接https://arxiv.org/abs/2604.20983

作者:Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Vision evaluations, Vision, multi-step processes, visual, Multimodal Large Language

备注: Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

98. 【2604.20936】AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

链接https://arxiv.org/abs/2604.20936

作者:Adam Cole,Mick Grierson

类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Video Diffusion Transformers, Video Diffusion, Diffusion Transformers, black-box video generation, internal mechanics

备注: To appear in the Proceedings of the 2026 ACM Creativity and Cognition (CC '26). 15 pages, 19 figures

点击查看摘要

Abstract:We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists' ability to build intuition for the model's material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model's learned representational space.

99. 【2604.20878】AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models

链接https://arxiv.org/abs/2604.20878

作者:Zijin Zhou,Songan Zhang

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Traffic Accident Detection, Traffic Accident Understanding, achieved remarkable progress, Multimodal Large Language, Accident Detection

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in Traffic Accident Detection (TAD) and Traffic Accident Understanding (TAU). However, existing studies mainly focus on describing and interpreting accident videos, leaving room for deeper causal reasoning and integration of legal knowledge. Traffic Accident Responsibility Allocation (TARA) is a more challenging task that requires multi-step reasoning grounded in traffic regulations. To address this, we introduce AITP (Artificial Intelligence Traffic Police), a multimodal large language model for responsibility reasoning and allocation. AITP enhances reasoning via a Multimodal Chain-of-Thought (MCoT) mechanism and integrates legal knowledge through Retrieval-Augmented Generation (RAG). We further present DecaTARA, a decathlon-style benchmark unifying ten interrelated traffic accident reasoning tasks with 67,941 annotated videos and 195,821 question-answer pairs. Extensive experiments show that AITP achieves state-of-the-art performance across responsibility allocation, TAD, and TAU tasks, establishing a new paradigm for reasoning-driven multimodal traffic analysis.

100. 【2604.20851】Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts

链接https://arxiv.org/abs/2604.20851

作者:Bingqing Zhang,Zhuo Cao,Heming Du,Yang Li,Xue Li,Jiajun Liu,Sen Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:sharp performance drop, Modern video-text retrieval, query data deviates, Modern video-text, training domain

备注: Accepted to ICLR2026

点击查看摘要

Abstract:Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.

101. 【2604.21518】DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

链接https://arxiv.org/abs/2604.21518

作者:Shiyan Su,Ruyi Zha,Danli Shi,Hongdong Li,Xuelian Cheng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:effectively model volumetric, Neural representations, neural fields, computed tomography, sparse-view settings

备注: Accepted to AAAI 2026. Project page: [this https URL](https://ooonesevennn.github.io/DiffNR/)

点击查看摘要

Abstract:Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.

102. 【2604.20981】PanGuide3D: Cohort-Robust Pancreas Tumor Segmentation via Probabilistic Pancreas Conditioning and a Transformer Bottleneck

链接https://arxiv.org/abs/2604.20981

作者:Sunny Joy Ma,Xiang Ma

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:contrast-enhanced computed tomography, surrounding soft tissue, Pancreatic tumor segmentation, cohort frequently degrade, computed tomography

备注

点击查看摘要

Abstract:Pancreatic tumor segmentation in contrast-enhanced computed tomography (CT) is clinically important yet technically challenging: lesions are often small, heterogeneous, and easily confused with surrounding soft tissue, and models that perform well on one cohort frequently degrade under cohort shift. Our goal is to improve cross-cohort generalization while keeping the model architecture simple, efficient, and practical for 3D CT segmentation. We introduce PanGuide3D, a cohort-robust architecture with a shared 3D encoder, a pancreas decoder that predicts a probabilistic pancreas map, and a tumor decoder that is explicitly conditioned on this pancreas probability at multiple scales via differentiable soft gating. To capture long-range context under distribution shift, we further add a lightweight Transformer bottleneck in the U-Net bottleneck representation. We evaluate cohort transfer by training on the PanTS (Pancreatic Tumor Segmentation) cohort and testing both in-cohort (PanTS) and out-of-cohort on MSD (Medical Segmentation Decathlon) Task07 Pancreas, using matched preprocessing and training protocols across strong baselines. We collect voxel-level segmentation metrics, patient-level tumor detection, subgroup analyses by tumor size and anatomical location, volume-conditioned performance analyses, and calibration measurements to assess reliability. Across the evaluated models, PanGuide3D achieves the best overall tumor performance and shows improved cross-cohort generalization, particularly for small tumors and challenging anatomical locations, while reducing anatomically implausible false positives. These findings support probabilistic anatomical conditioning as a practical strategy for improving cross-cohort robustness in an end-to-end model and suggest potential utility for contouring support, treatment planning, and multi-institutional studies.