本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新400篇论文，其中：

自然语言处理75篇
信息检索12篇
计算机视觉79篇

自然语言处理

1. 【2604.21928】Evaluation of Automatic Speech Recognition Using Generative Large Language Models

链接：https://arxiv.org/abs/2604.21928

作者：Thibault Bañeras-Roux,Shashi Kumar,Driss Khalil,Sergio Burdisso,Petr Motlicek,Shiran Liu,Mickael Rouvier,Jane Wottawa,Richard Dufour

类目：Computation and Language (cs.CL)

关键词：Automatic Speech Recognition, Word Error Rate, Automatic Speech, Speech Recognition, evaluated using Word

备注：

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

2. 【2604.21916】MathDuels: Evaluating LLMs as Problem Posers and Solvers

链接：https://arxiv.org/abs/2604.21916

作者：Zhiqiu Xu,Shibo Jin,Shreya Arya,Mayur Naik

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：attain near-ceiling performance, static mathematical benchmarks, language models attain, models attain near-ceiling, cast models solely

备注：

点击查看摘要

Abstract:As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

3. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

链接：https://arxiv.org/abs/2604.21911

作者：Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large

备注：

点击查看摘要

Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .

4. 【2604.21901】GiVA: Gradient-Informed Bases for Vector-Based Adaptation

链接：https://arxiv.org/abs/2604.21901

作者：Neeraj Gangwar,Rishabh Deshmukh,Michael Shavlovsky,Hancao Li,Vivek Mittal,Lexing Ying,Nickvash Kani

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：model sizes continue, parameter-efficient fine-tuning, continue to grow, full fine-tuning, model sizes

备注： Accepted to AISTATS 2026

点击查看摘要

Abstract:As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).

5. 【2604.21897】Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

链接：https://arxiv.org/abs/2604.21897

作者：Flávio Soriano,Victoria F. Mello,Pedro B. Rigueira,Gisele L. Pappa,Wagner Meira Jr.,Ana Paula Couto da Silva,Jussara M. Almeida

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：voting records, overlooking the rich, political speech, behavior often rely, rely on voting

备注： Accepted paper at ICWSM 2026

点击查看摘要

Abstract:Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.

6. 【2604.21890】EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

链接：https://arxiv.org/abs/2604.21890

作者：Praval Sharma,Ashok Samal,Leen-Kiat Soh,Deepti Joshi

类目：Computation and Language (cs.CL)

关键词：Event extraction identifies, identifies the central, central aspects, Event extraction, Event

备注：

点击查看摘要

Abstract:Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

7. 【2604.21889】ngIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

链接：https://arxiv.org/abs/2604.21889

作者：Jun Wang,Ziyin Zhang,Rui Wang,Hang Yu,Peng Di,Rui Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：large-scale cloud-native services, massive financial losses, diminished user trust, Real-time detection, cloud-native services

备注： Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

8. 【2604.21885】A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

链接：https://arxiv.org/abs/2604.21885

作者：Praval Sharma

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Event extraction, open-domain event extraction, Event, event extraction approaches, open-domain event

备注：

点击查看摘要

Abstract:Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

9. 【2604.21882】Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

链接：https://arxiv.org/abs/2604.21882

作者：Yuto Nishida,Naoki Shikoda,Yosuke Kishinami,Ryo Fujii,Makoto Morishita,Hidetaka Kamigaito,Taro Watanabe

类目：Computation and Language (cs.CL)

关键词：knowledge large language, Understanding what kinds, factual knowledge large, large language models, memorize is essential

备注： Accepted to ACL 2026 Main

点击查看摘要

Abstract:Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

10. 【2604.21871】Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

链接：https://arxiv.org/abs/2604.21871

作者：Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Wenchao Dong,Jaehong Kim,Meeyoung Cha

类目：Computation and Language (cs.CL)

关键词：interpersonal relationships, context-dependent and modulated, modulated by interpersonal, predicted human behavior, predicted human

备注： ACL-Findings 2026

点击查看摘要

Abstract:Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.

11. 【2604.21794】Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

链接：https://arxiv.org/abs/2604.21794

作者：Ye Yu,Heming Liu,Haibo Jin,Xiaopeng Yuan,Peng Kuang,Haohan Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词：large language models, shown strong performance, complex reasoning tasks, treating inter-agent communication, fixed interface

备注： Under review at COLM 2026

点击查看摘要

Abstract:Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

12. 【2604.21782】SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

链接：https://arxiv.org/abs/2604.21782

作者：Hans Ole Hatzel,Ekaterina Artemova,Haimo Paul Stiemer,Evelyn Gius,Chris Biemann

类目：Computation and Language (cs.CL)

关键词：narrative representation learning, present the shared, narrative similarity, NSNRL, narrative

备注：

点击查看摘要

Abstract:We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.

13. 【2604.21767】Misinformation Span Detection in Videos via Audio Transcripts

链接：https://arxiv.org/abs/2604.21767

作者：Breno Matos,Rennan C. Lima,Savvas Zannettou,Fabricio Benevenuto,Rodrygo L.T. Santos

类目：Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词：yielding severe consequences, public health risks, including political polarization, including online social, misinformation

备注： Accepted at ICWSM 2026

点击查看摘要

Abstract:Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video's misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video's misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video's audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos.

Comments:
Accepted at ICWSM 2026

Subjects:

Computation and Language (cs.CL); Social and Information Networks (cs.SI)

Cite as:
arXiv:2604.21767 [cs.CL]

(or
arXiv:2604.21767v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.21767

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

14. 【2604.21766】AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

链接：https://arxiv.org/abs/2604.21766

作者：Tasnim Kabir,Dmytro Kurdydyk,Aadi Palnitkar,Liam Dorn,Ahmed Haj Ahmed,Jordan Lee Boyd-Graber

类目：Computation and Language (cs.CL)

关键词：Internet Trivia Authors, Diverse Internet Trivia, Understanding from Diverse, Diverse Internet, surface-level acoustic recognition

备注：

点击查看摘要

Abstract:Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

15. 【2604.21751】Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

链接：https://arxiv.org/abs/2604.21751

作者：Joseba Fernandez de Landa,Carla Perez-Almendros,Jose Camacho-Collados

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Western and Anglocentric, Anglocentric viewpoints, amplifying Western, coverage and competence, showing limitations

备注：

点击查看摘要

Abstract:LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ). The results show that, contrary to previous cultural bias work, LLMs show a clear tendency towards countries such as Japan. Moveover, our results show that when prompting in languages such as English or other high-resource ones, LLMs tend to provide more diverse outputs and show less inclinations towards answering questions highlighting countries for which the input language is an official language. Finally, we also investigate at which point of LLM training this cultural bias emerges, with our results suggesting that the first clear signs appear after supervised fine-tuning, and not during pre-training.

16. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs

链接：https://arxiv.org/abs/2604.21748

作者：Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events

备注： Accepted by ACL 2026 main conference

点击查看摘要

Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .

17. 【2604.21725】AEL: Agent Evolving Learning for Open-Ended Environments

链接：https://arxiv.org/abs/2604.21725

作者：Wujiang Xu,Jiaojiao Han,Minghao Guo,Kai Mei,Xi Zhu,Han Zhang,Dimitris N. Metaxas

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词：LLM agents increasingly, open-ended environments spanning, environments spanning hundreds, agents increasingly operate, remain largely stateless

备注：

点击查看摘要

Abstract:LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: this https URL.

18. 【2604.21724】Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

链接：https://arxiv.org/abs/2604.21724

作者：Yilong Chen,Yanxi Xie,Zitian Gao,He Xin,Yihao Xiao,Renbiao Liu,Haoming Luo,Yifan Luo,Zhengmao Ye,Tingwen Liu,Xin Zhao,Ran Tao,Bryan Dai

类目：Computation and Language (cs.CL)

关键词：Large token-indexed lookup, poor parameter efficiency, Large token-indexed, compute-decoupled scaling path, rapid memory growth

备注： 29 pages, 9 figures, 13 tables

点击查看摘要

Abstract:Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in this https URL.

19. 【2604.21718】Building a Precise Video Language with Human-AI Oversight

链接：https://arxiv.org/abs/2604.21718

作者：Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：dynamic visual world, Video-language models, learn to reason, natural language, world through natural

备注： CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)

点击查看摘要

Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL

20. 【2604.21716】From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

链接：https://arxiv.org/abs/2604.21716

作者：Minh Duc Bui,Xenia Heilmann,Mattia Cerrato,Manuel Mager,Katharina von der Wense

类目：Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：Prior work evaluates, reveal solely overt, work evaluates code, evaluates code generation, Prior work

备注： Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

21. 【2604.21706】Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

链接：https://arxiv.org/abs/2604.21706

作者：Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts

类目：Computation and Language (cs.CL)

关键词：self-supervised speech representations, frozen self-supervised speech, severity assessment based, speech representations, previously introduced

备注： Submitted to Computer Speech Language

点击查看摘要

Abstract:We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared 0.14, Holm-corrected p 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.

22. 【2604.21700】Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

链接：https://arxiv.org/abs/2604.21700

作者：Jiali Wei,Ming Fan,Guoheng Sun,Xicheng Zhang,Haijun Wang,Ting Liu

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：raised urgent concerns, large language models, growing application, application of large, large language

备注：

点击查看摘要

Abstract:The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

23. 【2604.21698】Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

链接：https://arxiv.org/abs/2604.21698

作者：Marius Huber,David R. Reich,Lena A. Jäger

类目：Computation and Language (cs.CL); Machine Learning (cs.LG); Algebraic Topology (math.AT)

关键词：extracts robust, time series, Persistent homology, features, Copenhagen Corpus

备注： ETRA 2026

点击查看摘要

Abstract:Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features. We empirically evaluate our method by applying it to the task of dyslexia detection from eye-tracking-while-reading data using the Copenhagen Corpus, which contains scanpaths from dyslexic and non-dyslexic L1 and L2 readers. Our hybrid models outperform existing approaches that rely solely on traditional features, showing that persistent homology captures complementary information encoded in fixation sequences. The strength of these topological features is further underscored by their achieving performance comparable to established baseline methods. Importantly, our proposed filtrations outperform existing ones.

24. 【2604.21667】Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

链接：https://arxiv.org/abs/2604.21667

作者：Olufunke O. Sarumi,Charles Welch,Daniel Braun

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：exploring disaggregated labels, User Passport mechanism, representation-level User Passport, exploring disaggregated, User Passport

备注： Accepted at 5th NLPerspectives Workshop

点击查看摘要

Abstract:Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

25. 【2604.21649】GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

链接：https://arxiv.org/abs/2604.21649

作者：Qizhuo Xie,Yunhui Liu,Yu Xing,Qianzi Hou,Xudong Jin,Tao Zheng,Tieke He

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：shown immense potential, Knowledge Graph Completion, Large Language Models, LLM tokens remains, continuous graph embeddings

备注： ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at this https URL.

26. 【2604.21637】Multilinguality at the Edge: Developing Language Models for the Global South

链接：https://arxiv.org/abs/2604.21637

作者：Lester James V. Miranda,Songbo Hu,Roi Reichart,Anna Korhonen

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：deployed determines, Global South, language models, prevent effective deployment, hardware constrained communities

备注：

点击查看摘要

Abstract:Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed. To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. We also discuss open questions and provide actionable recommendations for different stakeholders in the NLP ecosystem. Finally, we hope that this work contributes to the development of inclusive and equitable language technologies.

27. 【2604.21611】Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

链接：https://arxiv.org/abs/2604.21611

作者：Hao-Yuan Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：LLM reasoning, Verbal Process Supervision, chain depth, sample breadth, GPQA Diamond

备注：

点击查看摘要

Abstract:Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

28. 【2604.21593】Language as a Latent Variable for Reasoning Optimization

链接：https://arxiv.org/abs/2604.21593

作者：Linjuan Wu,Haoran Wei,Jialong Tang,Shuang Luo,Baosong Yang,Yongliang Shen,Weiming Lu

类目：Computation and Language (cs.CL)

关键词：reduce English-centric bias, LLMs reduce English-centric, surprising trend emerges, English-centric bias, reduce English-centric

备注： 17 pages, 7 figures, Under Reviewing

点击查看摘要

Abstract:As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.

29. 【2604.21590】AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

链接：https://arxiv.org/abs/2604.21590

作者：Yuanjie Lyu,Chengyu Wang,Haonan Zheng,Yuanhao Yue,Junbing Yan,Ming Wang,Jun Huang

类目：Computation and Language (cs.CL)

关键词：Modern industrial applications, increasingly demand language, demand language models, Modern industrial, capable of multi-step

备注：

点击查看摘要

Abstract:Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: this https URL. Data synthesis and RL training code: this https URL. The data synthesis pipeline is also integrated into EasyDistill: this https URL.

30. 【2604.21564】Measuring Opinion Bias and Sycophancy via LLM-based Coercion

链接：https://arxiv.org/abs/2604.21564

作者：Rodrigo Nogueira,Giovana Kerche Bonás,Thales Sales Almeida,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau

类目：Computation and Language (cs.CL)

关键词：Large language models, information people consume, Large language, language models increasingly, models increasingly shape

备注：

点击查看摘要

Abstract:Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.

31. 【2604.21555】Finding Meaning in Embeddings: Concept Separation Curves

链接：https://arxiv.org/abs/2604.21555

作者：Paul Keuren,Marc Ponsen,Robert Ayoub Bagheri

类目：Computation and Language (cs.CL)

关键词：embedding techniques aim, encode key concepts, Sentence embedding techniques, Concept Separation Curves, vector space

备注： The code is open source and located on github at [this https URL](https://github.com/pkun-cbs/ConceptSeparationCurves) . Original conference paper

点击查看摘要

Abstract:Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model's capacity to differentiate between conceptual and surface-level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross-model approach for evaluating the conceptual stability of sentence embeddings.

32. 【2604.21534】UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

链接：https://arxiv.org/abs/2604.21534

作者：Darya Hryhoryeva,Amaia Zurinaga,Hamidreza Jamalabadi,Iryna Gurevych

类目：Computation and Language (cs.CL)

关键词：paper presents, pairwise Maximum Entropy, task requires modeling, Maximum Entropy, presents our system

备注： Accepted to SemEval 2026 (co-located with ACL 2026)

点击查看摘要

Abstract:This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.

33. 【2604.21525】Job Skill Extraction via LLM-Centric Multi-Module Framework

链接：https://arxiv.org/abs/2604.21525

作者：Guojing Li(1 and 2),Zichuan Fu(1),Junyi Li(1),Faxue Liu(1),Wenxia Zhou(2),Yejing Wang(1),Jingtong Gao(1),Maolin Wang(1),Rungen Liu(1),Wenlin Zhang(1),Xiangyu Zhao(1) ((1) City University of Hong Kong, (2) Renmin University of China)

类目：Computation and Language (cs.CL)

关键词：Span-level skill extraction, job advertisements underpins, advertisements underpins candidate-job, underpins candidate-job matching, yield malformed spans

备注： 5 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.

34. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

链接：https://arxiv.org/abs/2604.21523

作者：Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs

备注：

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.

35. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE

链接：https://arxiv.org/abs/2604.21511

作者：Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse

备注： 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025

点击查看摘要

Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

36. 【2604.21510】OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

链接：https://arxiv.org/abs/2604.21510

作者：Xinyu Zhang,Boxuan Zhang,Yuchen Wan,Lingling Zhang,YiXing Yao,Bifan Wei,Yaqiang Wu,Jun Liu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, demonstrate remarkable reasoning, Large Language, requiring domain knowledge, tasks remain challenging

备注：

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

37. 【2604.21496】How English Print Media Frames Human-Elephant Conflicts in India

链接：https://arxiv.org/abs/2604.21496

作者：Bonala Sai Punith,Salveru Jayati,Garima Shakya,Shubham Kumar Nigam

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：expanding human settlements, human settlements force, settlements force elephants, Human-elephant conflict, contact with people

备注：

点击查看摘要

Abstract:Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

38. 【2604.21495】Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

链接：https://arxiv.org/abs/2604.21495

作者：Hanjun Cho,Gahyun Yoo,Hanseong Kim,Jay-Yoon Lee

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：exhibits high in-domain, high in-domain accuracy, exhibits high, high in-domain, Numerical reasoning

备注： Accepted to TACL. This is a pre-MIT Press publication version

点击查看摘要

Abstract:Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner. By decoupling domain semantics and numerical operation structure, TaNOS improves the transferability of numerical reasoning. Applied to an 8B instruction-tuned model, TaNOS achieves 80.13% execution accuracy on FinQA with only 10% train data, outperforming SFT baseline (73.97%) with full train data and proprietary models such as GPT-5, Gemini-2.5-Pro. Furthermore, in the domain-shift experiments, TaNOS displays nearly-negligible cross-domain gap (2pp) when standard SFT shows over 10pp gap. These results suggest that structural guidance with operation sketches, header-agnostic representations, and correctness-guaranteed self-supervision can improve the robustness of numerical reasoning across diverse expert-domain tables.

39. 【2604.21481】Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

链接：https://arxiv.org/abs/2604.21481

作者：Srija Anand,Ashwin Sankar,Ishvinder Sethi,Aaditya Pareek,Kartik Rajput,Gaurav Yadav,Nikhil Narasimhan,Adish Pandya,Deepon Halder,Mohammed Safi Ur Rahman Khan,Praveen S V,Shobhit Banga,Mitesh M Khapra

类目：Computation and Language (cs.CL)

关键词：Crowdsourced pairwise evaluation, assessing foundation models, Crowdsourced pairwise, scalable approach, approach for assessing

备注：

点击查看摘要

Abstract:Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

40. 【2604.21469】Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

链接：https://arxiv.org/abs/2604.21469

作者：Fariz Ikhwantri,Dusica Marijan

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：challenging task due, regulatory compliance remains, legal texts, remains a challenging, complexity and variability

备注： 10 pages, 5 figures, 4 tables. 11th Special Session on Intelligent Data Mining, 2025 IEEE International Conference on Big Data

点击查看摘要

Abstract:Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.

41. 【2604.21454】Reasoning Primitives in Hybrid and Non-Hybrid LLMs

链接：https://arxiv.org/abs/2604.21454

作者：Shivam Rawat,Lucie Flek,Florian Mai,Nicholas Kluge Corrêa

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, monolithic capability, basic operations, large language, observed gains

备注：

点击查看摘要

Abstract:Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model's effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.

42. 【2604.21446】AI-Gram: When Visual Agents Interact in a Social Network

链接：https://arxiv.org/abs/2604.21446

作者：Andrew Shin

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

关键词：enabling image-based interactions, live platform enabling, platform enabling image-based, fully autonomous multi-agent, image-based interactions

备注：

点击查看摘要

Abstract:We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties. These results reveal a fundamental asymmetry in current agent architectures: strong expressive communication paired with a steadfast preservation of individual visual identity. We release AI-Gram as a publicly accessible, continuously evolving platform for studying social dynamics in Al-native multi-agent systems. this https URL

43. 【2604.21428】Decoupled DiLoCo for Resilient Distributed Pre-training

链接：https://arxiv.org/abs/2604.21428

作者：Arthur Douillard,Keith Rush,Yani Donchev,Zachary Charles,Nova Fallen,Ayush Dubey,Ionel Gog,Josef Dean,Blake Woodworth,Zachary Garrett,Nate Keating,Jenny Bishop,Henry Prior,Edouard Yvinec,Arthur Szlam,Marc'Aurelio Ranzato,Jeff Dean

类目：Computation and Language (cs.CL)

关键词：Modern large-scale language, pre-training relies heavily, requires tight coupling, Modern large-scale, program multiple data

备注：

点击查看摘要

Abstract:Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.

44. 【2604.21421】Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

链接：https://arxiv.org/abs/2604.21421

作者：Michele Miranda,Xinlan Yan,Nishant Mishra,Rachel Murphy,Ameen Abu-Hanna,Sébastien Bratières,Iacer Calixto

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：GDPR and HIPAA, Protecting patient privacy, Protecting patient, narratives is essential, essential for enabling

备注：

点击查看摘要

Abstract:Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

45. 【2604.21380】Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

链接：https://arxiv.org/abs/2604.21380

作者：Wang Shi Hai,Chen Tao

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：software performance requirements, software engineering, performance requirements, natural language, documented in natural

备注： 9 pages,accepted by ACL 2026

点击查看摘要

Abstract:Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.

46. 【2604.21375】VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

链接：https://arxiv.org/abs/2604.21375

作者：Qijun Han,Haoqin Tu,Zijun Wang,Haoyue Dai,Yiyang Zhou,Nancy Lau,Alvaro A. Cardenas,Yuhui Xu,Ran Xu,Caiming Xiong,Zeyu Zheng,Huaxiu Yao,Yuyin Zhou,Cihang Xie

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词：Autonomous GUI agents, Autonomous GUI, GUI agents face, agents prematurely declare, prematurely declare success

备注： The first two authors contribute equally

点击查看摘要

Abstract:Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

47. 【2604.21370】MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

链接：https://arxiv.org/abs/2604.21370

作者：Maziar Kianimoghadam Jouneghani

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：multilingual polarization detection, contrasting multilingual generalists, present a systematic, systematic study, polarization detection

备注： 9 pages, 9 tables. Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), Task 9

点击查看摘要

Abstract:We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: this https URL.

48. 【2604.21365】mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code

链接：https://arxiv.org/abs/2604.21365

作者：Adam Skurla,Dominik Macko,Jakub Simko

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：Multi-domain detection, challenging task, programming languages, machine-generated code snippets, Multi-domain

备注：

点击查看摘要

Abstract:Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task~13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.

49. 【2604.21357】ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

链接：https://arxiv.org/abs/2604.21357

作者：Jian Cui,Zhiyuan Ren,Desheng Weng,Yongqi Zhao,Gong Wenbin,Yu Lei,Zhenning Dong

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：including workflow complexity, traditional multi-stage approaches, vector similarity retrieval, geographic knowledge bases, structured geographic knowledge

备注： 12 pages, 8 figures, submitted to ACM SIGSPATIAL 2024 (under review)

点击查看摘要

Abstract:This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.

50. 【2604.21352】CARE: Counselor-Aligned Response Engine for Online Mental-Health Support

链接：https://arxiv.org/abs/2604.21352

作者：Hagai Astrin,Ayal Swaid,Avi Segal,Kobi Gal

类目：Computation and Language (cs.CL)

关键词：Mental health challenges, increasing worldwide, challenges are increasing, services and leading, Mental health

备注： 9 pages, 4 figures

点击查看摘要

Abstract:Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts.

Comments:
9 pages, 4 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.21352 [cs.CL]

(or
arXiv:2604.21352v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.21352

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

51. 【2604.21346】Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

链接：https://arxiv.org/abs/2604.21346

作者：Mohit Vaishnav,Tanel Tammet

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：main bottleneck lies, Bongard problems, language models, large language models, raising the question

备注：

点击查看摘要

Abstract:Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

52. 【2604.21345】Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

链接：https://arxiv.org/abs/2604.21345

作者：Philip Zhong,Don Wang,Jason Zhang,Kent Chen

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：reusable evaluation pipeline, public artifact package, artifact package derived, Dataset Pipeline, summaries and released

备注： AI Application Feature Quality Evaluation (28 pages total)

点击查看摘要

Abstract:We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.

Comments:
AI Application Feature Quality Evaluation (28 pages total)

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2604.21345 [cs.AI]

(or
arXiv:2604.21345v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.21345

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

53. 【2604.21344】Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

链接：https://arxiv.org/abs/2604.21344

作者：Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：present complex information, complex information, present complex, Multimodal Language Models, multiple related charts

备注：

点击查看摘要

Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

54. 【2604.21335】Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

链接：https://arxiv.org/abs/2604.21335

作者：Wei Jiang,Wei Wang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：finer control axis, prior work, offers a finer, finer control, control axis

备注： 16 pages, 14 tables, 2 figures

点击查看摘要

Abstract:Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.

55. 【2604.21334】Ideological Bias in LLMs' Economic Causal Reasoning

链接：https://arxiv.org/abs/2604.21334

作者：Donggyu Lee,Hyeok Yun,Jungwon Kim,Junsik Min,Sungwon Park,Sangyoon Park,Jihee Kim

类目：Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)

关键词：large language models, large language, bias when reasoning, exhibit systematic ideological, systematic ideological bias

备注：

点击查看摘要

Abstract:Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

56. 【2604.21327】Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

链接：https://arxiv.org/abs/2604.21327

作者：Yongcan Yu,Lingxiao He,Jian Liang,Kuangpu Guo,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Test-time reinforcement learning, reinforcement learning, time via pseudo-labeling, leaving it vulnerable, Denoised test-time Reinforcement

备注： Accepted to ACL 2026 Findings

点击查看摘要

Abstract:Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at this https URL.

57. 【2604.21309】When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

链接：https://arxiv.org/abs/2604.21309

作者：Nannan Huang,Iffat Maab,Junichi Yamagishi

类目：Computation and Language (cs.CL)

关键词：processing vast daily, political perspectives critical, daily news content, diverse political perspectives, Multi-document news summarisation

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.

58. 【2604.21308】CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents

链接：https://arxiv.org/abs/2604.21308

作者：Wenjie Fu,Xiaoting Qin,Jue Zhang,Qingwei Lin,Lukas Wutschitz,Robert Sim,Saravan Rajmohan,Dongmei Zhang

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：improve workplace productivity, dramatically improve workplace, Enterprise LLM agents, LLM agents, workplace productivity

备注：

点击查看摘要

Abstract:Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user's behalf, also creates new risks for sensitive information leakage. We introduce CI-Work, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.

59. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

链接：https://arxiv.org/abs/2604.21300

作者：Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE

备注：

点击查看摘要

Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{this https URL} \footnote{this https URL}.

60. 【2604.21286】Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding

链接：https://arxiv.org/abs/2604.21286

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：K-way energy probe, discriminative predictive coding, predictive coding networks, coding networks reduces, standard discriminative predictive

备注： 11 pages, 3 figures, 4 tables. Pre-registered on OSF ( [this https URL](https://osf.io/2kvsp) ). Code at [this https URL](https://github.com/synthiumjp/ima)

点击查看摘要

Abstract:Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use "metacognitive" operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.

61. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

链接：https://arxiv.org/abs/2604.21284

作者：Robin Dey,Panyanon Viradecha

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci

备注： 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)

点击查看摘要

Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.

62. 【2604.21276】Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

链接：https://arxiv.org/abs/2604.21276

作者：Srishti Ginjala,Eric Fosler-Lussier,Christopher W. Myers,Srinivasan Parthasarathy

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)

关键词：critical question arises, text-derived priors make, models replace task-specific, large language models, language models replace

备注：

点击查看摘要

Abstract:As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.

63. 【2604.21265】Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training

链接：https://arxiv.org/abs/2604.21265

作者：Yoshinori Nomura

类目：Computation and Language (cs.CL)

关键词：accelerates language acquisition, language significantly accelerates, significantly accelerates language, significantly accelerates, language acquisition

备注： 17 pages, 3 figures

点击查看摘要

Abstract:We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5\%$ perplexity improvement over random initialization ($p 0.001$, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d\!=\!64$, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ($p = 0.017$), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ($-3\% \to +3\% \to +6\%$ advantage of larger datasets from $d\!=\!16$ to $d\!=\!64$). Across the scales we study ($d\!\in\!\{16,32,64\}$, up to ${\sim}400$K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.

64. 【2604.21255】When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

链接：https://arxiv.org/abs/2604.21255

作者：Chenghao Yang,Yuning Zhang,Zhoufutu Wen,Tao Gong,Jiaheng Liu,Qi Chu,Nenghai Yu

类目：Computation and Language (cs.CL)

关键词：progress of LLM, LLM agents, primary driver, rapid progress, Action Graph Similarity

备注： Accepted by ACL 2026 Main Conference

点击查看摘要

Abstract:Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $\tau$-Bench and $\tau^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at this https URL.

65. 【2604.21254】Hyperloop Transformers

链接：https://arxiv.org/abs/2604.21254

作者：Abbas Zeitoun,Lucas Torroba-Hennigen,Yoon Kim

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：research generally aims, maximize model quality, model quality subject, architecture research generally, LLM architecture research

备注：

点击查看摘要

Abstract:LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.

66. 【2604.21253】Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation

链接：https://arxiv.org/abs/2604.21253

作者：Hanwen Gu,Chao Guo,Junle Wang,Wenda Xie,Yisheng Lv

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：producing monotonous scripts, contextual logical consistency, existing methods struggle, smooth character development, global narrative coherence

备注： Accepted to Findings of the Association for Computational Linguistics: ACL 2026

点击查看摘要

Abstract:While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.

67. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching

链接：https://arxiv.org/abs/2604.21238

作者：Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity

备注： Accepted by NLPCC 2025

点击查看摘要

Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.

68. 【2604.21235】Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness

链接：https://arxiv.org/abs/2604.21235

作者：Zihan Liang,Ziwen Pan,Ruoxuan Xiong

类目：Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)

关键词：offering rich temporal, rich temporal information, offering rich, Multimodal clinical records, rich temporal

备注： Findings of ACL 2026 (30 pages)

点击查看摘要

Abstract:Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient's latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.

69. 【2604.21229】EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

链接：https://arxiv.org/abs/2604.21229

作者：Julian Acuna

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large language model, Large language, language model assistants, assistants are increasingly, increasingly expected

备注： 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

70. 【2604.20996】AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

链接：https://arxiv.org/abs/2604.20996

作者：Tadesse Destaw Belay,Shahriar Kabir Nahin,Israel Abebe Azime,Ocean Monjur,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam,Anshuman Chhabra

类目：Computation and Language (cs.CL)

关键词：lack sufficient training, lack sufficient, Direct Preference Optimization, language learning systems, sufficient training resources

备注：

点击查看摘要

Abstract:How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at this https URL.

71. 【2604.20995】Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

链接：https://arxiv.org/abs/2604.20995

作者：Inderjeet Nair,Jie Ruan,Lu Wang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词：tools remain limited, poorly understood phenomenon, current diagnostic tools, diagnostic tools remain, model behaves aligned

备注： Under submission at COLM 2026 Won the Best Student Paper Award at MSLD 2026 @ UIUC

点击查看摘要

Abstract:Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model's strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of this http URL, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.

72. 【2604.20994】Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

链接：https://arxiv.org/abs/2604.20994

作者：Yannis Belkhiter,Giulio Zizzo,Sergio Maffeis,Seshu Tirupathi,John D. Kelleher

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：calling Large Language, Large Language Models, Large Language, drawn significant attention, function calling Large

备注：

点击查看摘要

Abstract:The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.

73. 【2604.20983】hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

链接：https://arxiv.org/abs/2604.20983

作者：Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Vision evaluations, Vision, multi-step processes, visual, Multimodal Large Language

备注： Accepted at ACL 2026 Findings

点击查看摘要

Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

74. 【2604.20917】he Path Not Taken: Duality in Reasoning about Program Execution

链接：https://arxiv.org/abs/2604.20917

作者：Eshgin Hasanov,Md Mahadi Hassan Sibat,Santu Karmaker,Aashish Yadavally

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)

关键词：Large language models, shown remarkable capabilities, Large language, diverse coding tasks, shown remarkable

备注： Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.

75. 【2604.20915】Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

链接：https://arxiv.org/abs/2604.20915

作者：Zhixin Zhang,Shabo Zhang,Chengcan Wu,Zeming Wei,Meng Sun

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE); Optimization and Control (math.OC)

关键词：high computational cost, long streams prohibited, Transformers suffer, length for self-attention, high computational

备注：

点击查看摘要

Abstract:Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.

信息检索

1. 【2604.21750】Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem

链接：https://arxiv.org/abs/2604.21750

作者：Anas Buhayh,Elizabeth McKinnie,Clement Canel,Robin Burke

类目：Information Retrieval (cs.IR)

关键词：Optimizing outcomes, developing multi-objective models, stakeholders in recommender, recommender systems, systems has historically

备注： 34th ACM Conference on User Modeling, Adaptation and Personalization

点击查看摘要

Abstract:Optimizing outcomes for multiple stakeholders in recommender systems has historically focused on algorithmic interventions, such as developing multi-objective models or re-ranking results from existing algorithms. However, structural changes to the recommendation ecosystem itself remain understudied. This paper explores the implications of algorithmic pluralism (also known as "middleware" in the governance literature), in which recommendation algorithms are decoupled from platforms, enabling users to select their preferred algorithm. Prior simulation work demonstrates that algorithmic choice benefits niche consumers and providers. Yet this approach raises critical questions about user modeling in the context of data portability: when users switch algorithms, what happens to their data? Noting that multiple data portability regulations have emerged to strengthen user data ownership and control. We examine how such policies affect user models and stakeholders' outcomes in recommendation setting. Our findings reveal that data portability scenarios produce varying effects on user utility across different recommendation algorithms. We highlight key policy considerations and implications for designing equitable recommendation ecosystems.

2. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs

链接：https://arxiv.org/abs/2604.21748

作者：Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词：Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events

备注： Accepted by ACL 2026 main conference

点击查看摘要

3. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection

链接：https://arxiv.org/abs/2604.21694

作者：Katarzyna Fojcik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation

备注：

点击查看摘要

Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.

4. 【2604.21675】Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion

链接：https://arxiv.org/abs/2604.21675

作者：Xin Song,Kaiyuan Li,Jinxin Hu

类目：Information Retrieval (cs.IR)

关键词：e-commerce marketing strategies, modern e-commerce marketing, Sales promotions, stimulate product purchases, play a pivotal

备注： 6 pages, accepted by 49th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'26)

点击查看摘要

Abstract:Sales promotions, as short-term incentives to stimulate product purchases, play a pivotal role in modern e-commerce marketing strategies. During promotional events, user behavior patterns exhibit distinct characteristics compared to regular periods. In the pre-promotion phase, users typically engage in product search and browsing without immediate purchases, adding items to carts in anticipation of promotional discounts. This behavior leads to delayed conversions, resulting in significantly lower conversion rates (CVR) before the promotion day. Although existing research has made progress in CVR prediction for promotion days using historical data, it largely overlooks the critical pre-promotion period. And delayed feedback modeling has been extensively studied, current approaches fail to account for the unique distribution shifts in conversion behavior before promotional events, where delayed conversions predominantly occur on the promotion day rather than over continuous time windows. To address these limitations, we propose the Counterfactual Multi-task Delayed Conversion Model (CM-DCM), which leverages historical pre-promotion data to enhance CVR prediction for both delayed and direct conversions. Our model incorporates three key innovations: (i) A multi-task architecture that jointly models direct and delayed conversions using historical pre-promotion data; (ii) A personalized user behavior gating module to mitigate data sparsity issues during brief pre-promotion periods; (iii) A counterfactual causal approach to model the transition probability from add-to-cart (ATC) to delayed conversion. Extensive experiments demonstrate that CM-DCM outperforms baselines in pre-promotion scenarios. Online A/B tests during major promotional events showed significant improvements in advertising revenue, delayed conversion GMV, and overall GMV, validating the effectiveness of our approach.

5. 【2604.21536】Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

链接：https://arxiv.org/abs/2604.21536

作者：Nikita Severin,Danil Kartushov,Vladislav Urzhumov,Vladislav Kulikov,Oksana Konovalova,Alexey Grishanov,Anton Klenitskiy,Artem Fatkulin,Alexey Vasilev,Andrey Savchenko,Ilya Makarov

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：achieved significant success, modeling temporal user, capturing rich user, temporal user behavior, rich user semantics

备注： Accepted to ECIR 2026. 7 pages. This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: [this http URL](http://dx.doi.org/10.1007/978-3-032-21300-6_42)

点击查看摘要

Abstract:Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.

6. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE

链接：https://arxiv.org/abs/2604.21511

作者：Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse

备注： 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025

点击查看摘要

7. 【2604.21305】WPGRec: Wavelet Packet Guided Graph Enhanced Sequential Recommendation

链接：https://arxiv.org/abs/2604.21305

作者：Peilin Liu,Zhiquan Ji,Gang Yan

类目：Information Retrieval (cs.IR)

关键词：model users' evolving, users' evolving interests, localized behavioral fluctuations, non-stationary interaction streams, Sequential recommendation aims

备注： Accepted to SIGIR 2026, 8 pages, 3 figures

点击查看摘要

Abstract:Sequential recommendation aims to model users' evolving interests from noisy and non-stationary interaction streams, where long-term preferences, short-term intents, and localized behavioral fluctuations may coexist across temporal scales. Existing frequency-domain methods mainly rely on either global spectral operations or filter-based wavelet processing. However, global spectral operations tend to entangle local transients with long-range dependencies, while filter-based wavelet pipelines may suffer from temporal misalignment and boundary artifacts during multi-scale decomposition and reconstruction. Moreover, collaborative signals from the user-item interaction graph are often injected through scale-inconsistent auxiliary modules, limiting the benefit of jointly modeling temporal dynamics and structural dependencies. To address these issues, we propose Wavelet Packet Guided Graph Enhanced Sequential Recommendation (WPGRec), a unified time-frequency and graph-enhanced framework that aligns multi-resolution temporal modeling with graph propagation at matching scales. WPGRec first applies a full-tree undecimated stationary wavelet packet transform to generate equal-length, shift-invariant subband sequences. It then performs subband-wise interaction-graph propagation to inject high-order collaborative information while preserving temporal alignment across resolutions. Finally, an energy- and spectral-flatness-aware gated fusion module adaptively aggregates informative subbands and suppresses noise-like components. Extensive experiments on four public benchmarks show that WPGRec consistently outperforms sequential and graph-based baselines, with particularly clear gains on sparse and behaviorally complex datasets, highlighting the effectiveness of band-consistent structure injection and adaptive subband fusion for sequential recommendation.

8. 【2604.21304】PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

链接：https://arxiv.org/abs/2604.21304

作者：Yanjun Zhao,Tianxin Wei,Jiaru Zou,Xuying Ning,Yuanchen Bei,Lingjie Chen,Simmi Rana,Wendy H. Yang,Hanghang Tong,Jingrui He

类目：Information Retrieval (cs.IR)

关键词：answering isolated questions, summarizing content, scientific, questions or summarizing, scientific papers requires

备注：

点击查看摘要

Abstract:Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// this http URL.

9. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

链接：https://arxiv.org/abs/2604.21300

作者：Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词：Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE

备注：

点击查看摘要

10. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture

链接：https://arxiv.org/abs/2604.21284

作者：Robin Dey,Panyanon Viradecha

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci

备注： 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)

点击查看摘要

11. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching

链接：https://arxiv.org/abs/2604.21238

作者：Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu

类目：Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词：enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity

备注： Accepted by NLPCC 2025

点击查看摘要

12. 【2604.21019】Following the Eye-Tracking Evidence: Established Web-Search Assumptions Fail in Carousel Interfaces

链接：https://arxiv.org/abs/2604.21019

作者：Jingwei Kang,Maarten de Rijke,Harrie Oosterhuis

类目：Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词：streaming media services, Carousel interfaces, interfaces, Carousel, de-facto standard

备注：

点击查看摘要

Abstract:Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted.

Subjects:

Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2604.21019 [cs.IR]

(or
arXiv:2604.21019v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2604.21019

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

计算机视觉

1. 【2604.21931】Seeing Fast and Slow: Learning the Flow of Time in Videos

链接：https://arxiv.org/abs/2604.21931

作者：Yen-Siang Wu,Rundong Luo,Jingsen Zhu,Tao Tu,Ali Farhadi,Matthew Wallingford,Yu-Chiang Frank Wang,Steve Marschner,Wei-Chiu Ma

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)

关键词：temporal, videos, time, video, Abstract

备注： Project page: [this https URL](https://seeing-fast-and-slow.github.io/)

点击查看摘要

Abstract:How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

2. 【2604.21926】Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

链接：https://arxiv.org/abs/2604.21926

作者：Hao-Yu Hsu,Tianhang Cheng,Jing Wen,Alexander G. Schwing,Shenlong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：surrounding environments typically, environments typically relies, cameras pose persistent, pose persistent challenges, energy efficiency

备注： Project page: [this https URL](https://tianhang-cheng.github.io/IMU4D)

点击查看摘要

Abstract:Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

3. 【2604.21921】Context Unrolling in Omni Models

链接：https://arxiv.org/abs/2604.21921

作者：Ceyuan Yang,Zhijie Lin,Yang Zhao,Fei Xiao,Hao He,Qi Zhao,Chaorui Deng,Kunchang Li,Zihan Ding,Yuwei Guo,Fuyun Wang,Fangqi Zhu,Xiaonan Nie,Shenhan Zhu,Shanchuan Lin,Hongsheng Li,Weilin Huang,Guang Shi,Haoqi Fan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：model natively trained, enables Context Unrolling, natively trained, trained on diverse, Context Unrolling

备注： Report

点击查看摘要

Abstract:We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

4. 【2604.21915】Vista4D: Video Reshooting with 4D Point Clouds

链接：https://arxiv.org/abs/2604.21915

作者：Kuan Heng Lin,Zhizheng Liu,Pablo Salamanca,Yash Kant,Ryan Burgert,Yuancheng Xu,Koichi Namekata,Yiwei Zhao,Bolei Zhou,Micah Goldblum,Paul Debevec,Ning Yu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：flexible video reshooting, video reshooting framework, robust and flexible, framework that grounds, input video

备注： 24 pages, 20 figures, CVPR 2026, see project page at [this https URL](https://eyeline-labs.github.io/Vista4D)

点击查看摘要

Abstract:We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: this https URL

5. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

链接：https://arxiv.org/abs/2604.21911

作者：Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large

备注：

点击查看摘要

6. 【2604.21909】Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

链接：https://arxiv.org/abs/2604.21909

作者：Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)

关键词：reach similar classification, similar classification accuracy, kinds of mistakes, modern vision models, reach similar

备注：

点击查看摘要

Abstract:Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.

7. 【2604.21904】UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

链接：https://arxiv.org/abs/2604.21904

作者：Yanran Zhang,Wenzhao Zheng,Yifei Li,Bingyao Yu,Yu Zheng,Lei Chen,Jiwen Lu,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generated image detection, image detection, recent years, significant progress, image generation

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{this https URL}{this https URL}.

8. 【2604.21879】Addressing Image Authenticity When Cameras Use Generative AI

链接：https://arxiv.org/abs/2604.21879

作者：Umar Masud,Abhijith Punnappurath,Luxi Zhao,David B. Lindell,Michael S. Brown

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：images shared online, image, photorealistically alter camera, methods to photorealistically, shared online

备注： To appear in CVPR 2026 Workshop on Authenticity and Provenance in the Age of Generative AI

点击查看摘要

Abstract:The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the 'unhallucinated' version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.

9. 【2604.21873】Grounding Video Reasoning in Physical Signals

链接：https://arxiv.org/abs/2604.21873

作者：Alibay Osmanli,Zixu Cheng,Shaogang Gong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Physical video understanding, video understanding requires, Physical video, event correctly, understanding requires

备注： Benchmark for Grounding Video Reasoning in Physical Signals

点击查看摘要

Abstract:Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video QA reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.

10. 【2604.21814】Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

链接：https://arxiv.org/abs/2604.21814

作者：Bowen Liu,Li Yang,Shanshan Song,Mingyu Tang,Zhifang Gao,Qifeng Chen,Yangqiu Song,Huimin Chen,Xiaomeng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：enables non-invasive gastrointestinal, leaving video-level analysis, video-level analysis underexplored, remains largely limited, Capsule endoscopy

备注：

点击查看摘要

Abstract:Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

11. 【2604.21810】Multiscale Super Resolution without Image Priors

链接：https://arxiv.org/abs/2604.21810

作者：Daniel Fu,Gabby Litterio,Pedro Felzenszwalb,Rashid Zia

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：address the ambiguities, super-resolution problem, pixel sizes, super-resolution, pixel

备注：

点击查看摘要

Abstract:We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.

12. 【2604.21806】EMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

链接：https://arxiv.org/abs/2604.21806

作者：Zixu Li,Yupeng Hu,Zhiheng Fu,Zhiwei Chen,Yongqi Li,Liqiang Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Composed Image Retrieval, Composed Image, important image retrieval, image retrieval paradigm, Insufficient Entity Coverage

备注： Accepted by ACL 2026

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at this https URL.

13. 【2604.21801】SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

链接：https://arxiv.org/abs/2604.21801

作者：Safouane El Ghazouali,Nicola Venturi,Michael Rueegsegger,Umberto Michelucci

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Recent advances, sensing rely heavily, large annotated datasets, tasks remains costly, acquiring high-quality ground

备注：

点击查看摘要

Abstract:Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: this https URL.

14. 【2604.21786】From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

链接：https://arxiv.org/abs/2604.21786

作者：Katharina Prasse,Steffen Jung,Isaac Bravo,Stefanie Walter,Patrick Knab,Christian Bartelt,Margret Keuper

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：communication strategies mobilise, strategies mobilise public, mobilise public concern, Social media platforms, Social media

备注：

点击查看摘要

Abstract:Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at this https URL.

15. 【2604.21776】Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

链接：https://arxiv.org/abs/2604.21776

作者：Avinash Paliwal,Adithya Iyer,Shivin Yadav,Muhammad Ali Afridi,Midhun Harikumar

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：paired multi-view data, Precise camera control, Precise camera, severe scarcity, scarcity of paired

备注：

点击查看摘要

Abstract:Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.

16. 【2604.21772】Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

链接：https://arxiv.org/abs/2604.21772

作者：Yingkai Yang,Chaoqi Chen,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Continual Test-Time Adaptation, Open-set Continual Test-Time, mitigate distributional shifts, term Open-set Continual, Test-Time Adaptation

备注： Accepted to CVPR 2026

点击查看摘要

Abstract:Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.

17. 【2604.21760】Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

链接：https://arxiv.org/abs/2604.21760

作者：Timothy Joseph Murphy,Jennifer Cook,Hélio Clemente José Cuve

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词：strong benchmark performance, deep learning approaches, offer limited insight, benchmark performance, manipulated facial behavior

备注： Main paper: 19 pages, 5 figures, 4 tables. SI Appendix: 11 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.

18. 【2604.21743】Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

链接：https://arxiv.org/abs/2604.21743

作者：Dat To-Thanh,Nghia Nguyen-Trong,Hoang Vo,Hieu Bui-Minh,Tinh-Anh Nguyen-Nhu

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：fast processing speeds, processing speeds required, balance high output, Image enhancement models, image enhancement model

备注： 10 pages, 3 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026

点击查看摘要

Abstract:Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features. Moreover, we incorporate Quantization-Aware Training (QAT) to simulate the effects of low-precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post-training quantization (PTQ). Experimental results demonstrate that the proposed method produces high-fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at this https URL.

19. 【2604.21728】Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

链接：https://arxiv.org/abs/2604.21728

作者：Wenxuan Bao,Yanjun Zhao,Xiyuan Yang,Jingrui He

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Pretrained vision-language models, CLIP exhibit strong, Pretrained vision-language, CLIP exhibit, exhibit strong zero-shot

备注： Accepted by CVPR 2026 (Findings Track)

点击查看摘要

Abstract:Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at this https URL .

20. 【2604.21718】Building a Precise Video Language with Human-AI Oversight

链接：https://arxiv.org/abs/2604.21718

作者：Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：dynamic visual world, Video-language models, learn to reason, natural language, world through natural

备注： CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)

点击查看摘要

21. 【2604.21713】Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

链接：https://arxiv.org/abs/2604.21713

作者：Guangkai Xu,Hua Geng,Huanyi Zheng,Songyi Yin,Yanlong Sun,Hao Chen,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made rapid progress, recently made rapid, Feed-forward visual geometry, visual geometry estimation, rapid progress

备注： Accepted to CVPR 2026. GitHub Page: [this https URL](https://github.com/aim-uofa/CARVE)

点击查看摘要

Abstract:Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

22. 【2604.21712】Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

链接：https://arxiv.org/abs/2604.21712

作者：Yang Liu,Zhiyong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：monocular RGB images, RGB images aims, estimate anatomically plausible, monocular RGB, RGB images

备注：

点击查看摘要

Abstract:3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.

23. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection

链接：https://arxiv.org/abs/2604.21694

作者：Katarzyna Fojcik

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation

备注：

点击查看摘要

24. 【2604.21689】StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

链接：https://arxiv.org/abs/2604.21689

作者：Kwan Yun,Changmin Lee,Ayeong Jeong,Youngseo Kim,Seungmi Lee,Junyong Noh

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)

关键词：Creative face stylization, diverse visual idioms, Creative face, face stylization aims, retaining recognizable identity

备注： SIGGRAPH 2026 / ACM TOG. Project page at [this https URL](https://kwanyun.github.io/StyleID_page/)

点击查看摘要

Abstract:Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at this https URL

25. 【2604.21686】WorldMark: A Unified Benchmark Suite for Interactive Video World Models

链接：https://arxiv.org/abs/2604.21686

作者：Xiaojie Xu,Zhengyuan Lin,Kang He,Yukang Feng,Xiaofeng Mao,Yuanyang Yin,Kaipeng Zhang,Yongtao Ge

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：making fair cross-model, Interactive video generation, cross-model comparison impossible, video generation models, fair cross-model comparison

备注：

点击查看摘要

Abstract:Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (this http URL), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.

26. 【2604.21681】Sapiens2

链接：https://arxiv.org/abs/2604.21681

作者：Rawal Khirodkar,He Wen,Julieta Martinez,Yuan Dong,Su Zhaoen,Shunsuke Saito

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：human-centric vision focused, focused on generalization, family of high-resolution, high-resolution transformers, transformers for human-centric

备注： Accepted to ICLR 2026

点击查看摘要

Abstract:We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: this https URL

27. 【2604.21668】Encoder-Free Human Motion Understanding via Structured Motion Descriptions

链接：https://arxiv.org/abs/2604.21668

作者：Yao Zhang,Zhuchenyang Liu,Thomas Ploetz,Yu Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：text-based large language, advancing rapidly, human motion understanding, including motion question, text-based large

备注：

点击查看摘要

Abstract:The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at this https URL.

28. 【2604.21654】Causal Disentanglement for Full-Reference Image Quality Assessment

链接：https://arxiv.org/abs/2604.21654

作者：Zhen Zhang,Jielei Chu,Tian Zhang,Weide Liu,Fengmao Lv,Tianrui Li,Jun Cheng,Yuming Fang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：deep network-based full-reference, performing pairwise comparisons, Existing deep network-based, models typically work, network-based full-reference image

备注：

点击查看摘要

Abstract:Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

29. 【2604.21631】DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

链接：https://arxiv.org/abs/2604.21631

作者：Xu Wang,Zhiru Wang,Shiyun Xie,Chengwei Pan,Yisong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Splatting, real-time photorealistic rendering, violate multi-view consistency, achieves real-time photorealistic, performance degrades significantly

备注： 10 pages,6 figures, accepted to Computer Vision and Pattern Recognition Conference 2026

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.

30. 【2604.21627】DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

链接：https://arxiv.org/abs/2604.21627

作者：Tahar Chettaoui,Eduarda Caldeira,Guray Ozgur,Raghavendra Ramachandra,Fadi Boutros,Naser Damer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：anticipate evolving threats, develop robust defensive, Advancing face morphing, robust defensive mechanisms, Advancing face

备注： Accepted At CVPR-W 2026

点击查看摘要

Abstract:Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.

31. 【2604.21617】Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

链接：https://arxiv.org/abs/2604.21617

作者：Frederik L. Dennig,Daniel A. Keim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：produce unpredictable shifts, real time, analysts embed, input variations, variations from measurement

备注： 6 pages, 3 figures, LaTeX, to appear at the 17th International EuroVis Workshop on Visual Analytics

点击查看摘要

Abstract:Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding. Our approach combines quantitative measures of mean displacement, bias, and nearest-anchor assignment error with per-anchor visualizations of displacement vectors, local PCA ellipsoids, and Voronoi misassignment for detailed inspection. We demonstrate the framework's effectiveness on UMAP- and t-SNE-based neural projectors of varying network sizes and study the effect of Jacobian regularization as a gradient-based robustness strategy. We apply our framework to the MNIST and Fashion-MNIST datasets. The results show that our framework identifies unstable projection regions invisible to reconstruction error or neighborhood-preservation metrics.

32. 【2604.21592】Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

链接：https://arxiv.org/abs/2604.21592

作者：Minghao Yin,Wenbo Hu,Jiale Xu,Ying Shan,Kai Han

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：prohibitive computational demand, yielded remarkable progress, generation remains elusive, Recent breakthroughs, static shape synthesis

备注：

点击查看摘要

Abstract:Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.

33. 【2604.21575】OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction

链接：https://arxiv.org/abs/2604.21575

作者：Zeyu Cai,Yuliang Xiu,Renke Wang,Zhijing Shao,Xiaoben Li,Siyuan Yu,Chao Xu,Yang Liu,Baigui Sun,Jian Yang,Zhenyu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：clothed human assets, underlying body model, clothed human, extensively studied, approaches focus

备注： Project Page: [this https URL](https://zcai0612.github.io/OmniFit/)

点击查看摘要

Abstract:Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.

34. 【2604.21573】CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction

链接：https://arxiv.org/abs/2604.21573

作者：Changfan Wang,Xinran Wang,Donghai Liu,Fei Su,Lulu Sun,Zhicheng Zhao,Zhu Meng

类目：Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：enables spatially resolved, limiting large-cohort studies, spatially resolved gene, resolved gene profiling, enables spatially

备注：

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (HE) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.

35. 【2604.21572】Deep kernel video approximation for unsupervised action segmentation

链接：https://arxiv.org/abs/2604.21572

作者：Silvia L. Pintea,Jouke Dijkstra

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unsupervised action segmentation, storing large datasets, per-video unsupervised action, action segmentation, unsupervised action

备注： Accepted at ICPR 2026

点击查看摘要

Abstract:This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.

36. 【2604.21546】Component-Based Out-of-Distribution Detection

链接：https://arxiv.org/abs/2604.21546

作者：Wenrui Liu,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：detection requires sensitivity, natural In-Distribution, requires sensitivity, sensitivity to subtle, overreacting to natural

备注：

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.

37. 【2604.21530】Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models

链接：https://arxiv.org/abs/2604.21530

作者：Laura Valeria Perez-Herrera,M.J. Garcia-Gonzalez,Karen Lopez-Linares

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：influence treatment decisions, Lung adenocarcinoma, accurately identifying growth, grading depends, treatment decisions

备注：

点击查看摘要

Abstract:Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.

38. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

链接：https://arxiv.org/abs/2604.21523

作者：Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs

备注：

点击查看摘要

39. 【2604.21519】Gmd: Gaussian mixture descriptor for pair matching of 3D fragments

链接：https://arxiv.org/abs/2604.21519

作者：Meijun Xiong,Zhenguo Shi,Xinyu Zhou,Yuhe Zhang,Shunli Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Gaussian Mixture Model, Gaussian Mixture Descriptor, Gaussian Mixture, reconstruct objects, fractured surfaces

备注： 24 pages, 10 figures. Published in Multimedia Systems

点击查看摘要

Abstract:In the automatic reassembly of fragments acquired using laser scanners to reconstruct objects, a crucial step is the matching of fractured surfaces. In this paper, we propose a novel local descriptor that uses the Gaussian Mixture Model (GMM) to fit the distribution of points, allowing for the description and matching of fractured surfaces of fragments. Our method involves dividing a local surface patch into concave and convex regions for estimating the k value of GMM. Then the final Gaussian Mixture Descriptor (GMD) of the fractured surface is formed by merging the regional GMDs. To measure the similarities between GMDs for determining adjacent fragments, we employ the L2 distance and align the fragments using Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP). The extensive experiments on real-scanned public datasets and Terracotta datasets demonstrate the effectiveness of our approach; furthermore, the comparisons with several existing methods also validate the advantage of the proposed method.

40. 【2604.21502】VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection

链接：https://arxiv.org/abs/2604.21502

作者：Yupeng Zhang,Ruize Han,Ningnan Guo,Wei Feng,Song Wang,Liang Wan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：significant domain shifts, single source domain, leading detectors trained, domain shifts, real-world scenarios

备注：

点击查看摘要

Abstract:In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.

41. 【2604.21479】Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

链接：https://arxiv.org/abs/2604.21479

作者：Yanjiao Liu,Jiawei Liu,Xun Gong,Zifei Nie

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large language models, attracted increasing research, increasing research attention, Large language, recently demonstrated strong

备注：

点击查看摘要

Abstract:Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

42. 【2604.21478】Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

链接：https://arxiv.org/abs/2604.21478

作者：Yuhan Luo,Tao Chen,Decheng Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasingly important role, visual data forgery, forgery detection plays, generative models, textbf

备注： The source code is available at [this https URL](https://github.com/Yuhan-Luo/Semantic-Fine-grained-Alignment-and-Mixture-of-Experts)

点击查看摘要

Abstract:Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

43. 【2604.21465】ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation

链接：https://arxiv.org/abs/2604.21465

作者：Junyan Luo,Peipeng Yu,Jianwei Fei,Shiya Zeng,Xiaoyu Zhou,Zhihua Xia,Xiang Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital security, technologies have rapidly, rapidly advanced, advanced with modern, modern generative

备注：

点击查看摘要

Abstract:Deepfake technologies have rapidly advanced with modern generative AI, and face swapping in particular poses serious threats to privacy and digital security. Existing proactive defenses mostly rely on pixel-level perturbations, which are ineffective against contemporary swapping models that extract robust high-level identity embeddings. We propose ID-Eraser, a feature-space proactive defense that removes identifiable facial information to prevent malicious face swapping. By injecting learnable perturbations into identity embeddings and reconstructing natural-looking protection images through a Face Revive Generator (FRG), ID-Eraser produces visually realistic results for humans while rendering the protected identities unusable for Deepfake models. Experiments show that ID-Eraser substantially disrupts identity recognition across diverse face recognition and swapping systems under strict black-box settings, achieving the lowest Top-1 accuracy (0.30) with the best FID (1.64) and LPIPS (0.020). Compared with swaps generated from clean inputs, the identity similarity of protected swaps drops sharply to an average of 0.504 across five representative face swapping models. ID-Eraser further demonstrates strong cross-dataset generalization, robustness to common distortions, and practical effectiveness on commercial APIs, reducing Tencent API similarity from 0.76 to 0.36.

44. 【2604.21461】Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

链接：https://arxiv.org/abs/2604.21461

作者：Chentao Li,Zirui Gao,Mingze Gao,Yinglian Ren,Jianjiang Feng,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词：natural language commands, resolve referential ambiguities, Multimodal Large Language, Large Language Models, smart glasses

备注： 20 pages, 14 figures. Committed to ACL 2026

点击查看摘要

Abstract:Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: this https URL

45. 【2604.21453】Instance-level Visual Active Tracking with Occlusion-Aware Planning

链接：https://arxiv.org/abs/2604.21453

作者：Haowei Sun,Kai Zhou,Hao Gao,Shiteng Zhang,Jinwu Hu,Xutao Wen,Qixiang Ye,Mingkui Tan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Active Tracking, Visual Active, aims to control, security surveillance, control cameras

备注： CVPR 2026 Poster

点击查看摘要

Abstract:Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.

46. 【2604.21450】VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

链接：https://arxiv.org/abs/2604.21450

作者：Yixuan Zhu,Shilin Ma,Haolin Wang,Ao Li,Yanzhe Jing,Yansong Tang,Lei Chen,Jiwen Lu,Jie Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Recent advancements, real-world image super-resolution, highlighting their potential, advancements in visual, demonstrated their effectiveness

备注： Accepted in ICLR 2026. Code is available at [this https URL](https://github.com/EternalEvan/VARestorer)

点击查看摘要

Abstract:Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

47. 【2604.21442】2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing

链接：https://arxiv.org/abs/2604.21442

作者：Shurui Wang,Yuhe Zhang,Ruizhe Guo,Yaning Zhang,Yifei Xie,Xinyu Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：presenting significant challenges, point cloud models, point cloud processing, massive point cloud, Kd-tree and Octree

备注： 13 pages, 13 figures. Published in The Computer Journal

点击查看摘要

Abstract:The development of 3D scanning technology has enabled the acquisition of massive point cloud models with diverse structures and large scales, thereby presenting significant challenges in point cloud processing. Fast neighboring points search is one of the most common problems, which is frequently used in model reconstruction, classification, retrieval and feature visualization. Hash function is well known for its high-speed and accurate performance in searching high-dimensional data, which is also the core of the proposed 2L-LSH. Specifically, the 2L-LSH algorithm adopts a two-step hash function strategy, in which the popular step divides the bounding box of the point cloud model and the second step constructs a generalized table-based data structure. The proposed 2L-LSH offers a highly efficient and accurate solution for fast neighboring points search in large-scale 3D point cloud models, making it a promising technique for various applications in the field. The proposed algorithm is compared with the well-known methods including Kd-tree and Octree; the obtained results demonstrated that the proposed method outperforms Kd-tree and Octree in terms of speed, i.e. the time consumption of kNN search can be 51.111% and 94.159% lower than Kd-tree and Octree, respectively. And the RN search time can be 54.519% and 41.840% lower than Kd-tree and Octree, respectively.

48. 【2604.21435】UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery

链接：https://arxiv.org/abs/2604.21435

作者：Jingfang Li,Haoran Zhu,Wen Yang,Jinrui Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：modern remote sensing, offering unprecedented spatial, remote sensing, offering unprecedented, essential for modern

备注：

点击查看摘要

Abstract:Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at this https URL.

49. 【2604.21422】Pre-process for segmentation task with nonlinear diffusion filters

链接：https://arxiv.org/abs/2604.21422

作者：Javier Sanguino,Carlos Platero,Olga Velasco

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：nonlinear diffusion, nonlinear diffusion equation, Toggle, nonlinear diffusion filters, diffusion

备注： Manuscript from 2017, previously unpublished, 37 pages

点击查看摘要

Abstract:This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at this https URL.

Comments:
Manuscript from 2017, previously unpublished, 37 pages

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

MSC classes:
68U10 (Image processing), 68T45 (Machine vision and scene understanding), 65M06 (Finite difference methods)

Cite as:
arXiv:2604.21422 [cs.CV]

(or
arXiv:2604.21422v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.21422

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Carlos Platero PhD [view email] [v1]
Thu, 23 Apr 2026 08:38:45 UTC (1,261 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Pre-process for segmentation task with nonlinear diffusion filters, by Javier Sanguino and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CV

|
next

new
|
recent
| 2026-04

Change to browse by: