本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新400篇论文,其中:
- 自然语言处理75篇
- 信息检索12篇
- 计算机视觉79篇
自然语言处理
1. 【2604.21928】Evaluation of Automatic Speech Recognition Using Generative Large Language Models
链接:https://arxiv.org/abs/2604.21928
作者:Thibault Bañeras-Roux,Shashi Kumar,Driss Khalil,Sergio Burdisso,Petr Motlicek,Shiran Liu,Mickael Rouvier,Jane Wottawa,Richard Dufour
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Word Error Rate, Automatic Speech, Speech Recognition, evaluated using Word
备注:
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
2. 【2604.21916】MathDuels: Evaluating LLMs as Problem Posers and Solvers
链接:https://arxiv.org/abs/2604.21916
作者:Zhiqiu Xu,Shibo Jin,Shreya Arya,Mayur Naik
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:attain near-ceiling performance, static mathematical benchmarks, language models attain, models attain near-ceiling, cast models solely
备注:
点击查看摘要
Abstract:As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.
3. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
链接:https://arxiv.org/abs/2604.21911
作者:Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large
备注:
点击查看摘要
Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .
4. 【2604.21901】GiVA: Gradient-Informed Bases for Vector-Based Adaptation
链接:https://arxiv.org/abs/2604.21901
作者:Neeraj Gangwar,Rishabh Deshmukh,Michael Shavlovsky,Hancao Li,Vivek Mittal,Lexing Ying,Nickvash Kani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:model sizes continue, parameter-efficient fine-tuning, continue to grow, full fine-tuning, model sizes
备注: Accepted to AISTATS 2026
点击查看摘要
Abstract:As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation. We evaluate GiVA across diverse benchmarks, including natural language understanding, natural language generation, and image classification. Experiments show that our approach consistently outperforms or achieves performance competitive with existing vector-based adaptation methods and LoRA while reducing rank requirements by a factor of eight ($8\times$).
5. 【2604.21897】Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach
链接:https://arxiv.org/abs/2604.21897
作者:Flávio Soriano,Victoria F. Mello,Pedro B. Rigueira,Gisele L. Pappa,Wagner Meira Jr.,Ana Paula Couto da Silva,Jussara M. Almeida
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:voting records, overlooking the rich, political speech, behavior often rely, rely on voting
备注: Accepted paper at ICWSM 2026
点击查看摘要
Abstract:Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches. We apply this framework to a large-scale case study of the Brazilian Chamber of Deputies, using a corpus of over 450,000 speeches from 2003 to 2025. Our results show a long-term stylistic shift toward shorter and more direct speeches, a legislative agenda that reorients sharply in response to national crises, and a granular map of discursive alignments in which regional and gender identities often prove more salient than formal party affiliation. More broadly, this work offers a robust methodology for analyzing parliamentary discourse as a multidimensional phenomenon that complements traditional vote-based approaches.
6. 【2604.21890】EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents
链接:https://arxiv.org/abs/2604.21890
作者:Praval Sharma,Ashok Samal,Leen-Kiat Soh,Deepti Joshi
类目:Computation and Language (cs.CL)
关键词:Event extraction identifies, identifies the central, central aspects, Event extraction, Event
备注:
点击查看摘要
Abstract:Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.
7. 【2604.21889】ngIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
链接:https://arxiv.org/abs/2604.21889
作者:Jun Wang,Ziyin Zhang,Rui Wang,Hang Yu,Peng Di,Rui Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large-scale cloud-native services, massive financial losses, diminished user trust, Real-time detection, cloud-native services
备注: Accepted to ACL 2026 Industry Track
点击查看摘要
Abstract:Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.
8. 【2604.21885】A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents
链接:https://arxiv.org/abs/2604.21885
作者:Praval Sharma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Event extraction, open-domain event extraction, Event, event extraction approaches, open-domain event
备注:
点击查看摘要
Abstract:Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.
9. 【2604.21882】Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
链接:https://arxiv.org/abs/2604.21882
作者:Yuto Nishida,Naoki Shikoda,Yosuke Kishinami,Ryo Fujii,Makoto Morishita,Hidetaka Kamigaito,Taro Watanabe
类目:Computation and Language (cs.CL)
关键词:knowledge large language, Understanding what kinds, factual knowledge large, large language models, memorize is essential
备注: Accepted to ACL 2026 Main
点击查看摘要
Abstract:Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.
10. 【2604.21871】Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions
链接:https://arxiv.org/abs/2604.21871
作者:Jiseon Kim,Jea Kwon,Luiz Felipe Vecchietti,Wenchao Dong,Jaehong Kim,Meeyoung Cha
类目:Computation and Language (cs.CL)
关键词:interpersonal relationships, context-dependent and modulated, modulated by interpersonal, predicted human behavior, predicted human
备注: ACL-Findings 2026
点击查看摘要
Abstract:Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.
11. 【2604.21794】Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
链接:https://arxiv.org/abs/2604.21794
作者:Ye Yu,Heming Liu,Haibo Jin,Xiaopeng Yuan,Peng Kuang,Haohan Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:large language models, shown strong performance, complex reasoning tasks, treating inter-agent communication, fixed interface
备注: Under review at COLM 2026
点击查看摘要
Abstract:Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.
12. 【2604.21782】SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
链接:https://arxiv.org/abs/2604.21782
作者:Hans Ole Hatzel,Ekaterina Artemova,Haimo Paul Stiemer,Evelyn Gius,Chris Biemann
类目:Computation and Language (cs.CL)
关键词:narrative representation learning, present the shared, narrative similarity, NSNRL, narrative
备注:
点击查看摘要
Abstract:We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations. We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement. This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ. We received a total of 71 final submissions from 46 teams across our two tracks. In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions. Our analysis identifies potential headroom for improvement of automated systems in both tracks. The task website includes visualizations of embeddings alongside instance-level classification results for all teams.
13. 【2604.21767】Misinformation Span Detection in Videos via Audio Transcripts
链接:https://arxiv.org/abs/2604.21767
作者:Breno Matos,Rennan C. Lima,Savvas Zannettou,Fabricio Benevenuto,Rodrygo L.T. Santos
类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:yielding severe consequences, public health risks, including political polarization, including online social, misinformation
备注: Accepted at ICWSM 2026
点击查看摘要
Abstract:Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms. Previous research efforts investigated detecting video-based misinformation, focusing on whether a video shares misinformation or not on a video level. While this approach is useful, it only provides a limited and non-easily interpretable view of the problem given that it does not provide an additional context of when misinformation occurs within videos and what content (i.e., claims) are responsible for the video's misinformation nature. In this work, we attempt to bridge this research gap by creating two novel datasets that allow us to explore misinformation detection on videos via audio transcripts, focusing on identifying the span of videos that are responsible for the video's misinformation claim (misinformation span detection). We present two new datasets for this task. We transcribe each video's audio to text, identifying the video segment in which the misinformation claims appears, resulting in two datasets of more than 500 videos with over 2,400 segments containing annotated fact-checked claims. Then, we employ classifiers built with state-of-the-art language models, and our results show that we can identify in which part of a video there is misinformation with an F1 score of 0.68. We make publicly available our annotated datasets. We also release all transcripts, audio and videos.
Comments:
Accepted at ICWSM 2026
Subjects:
Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Cite as:
arXiv:2604.21767 [cs.CL]
(or
arXiv:2604.21767v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.21767
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2604.21766】AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
链接:https://arxiv.org/abs/2604.21766
作者:Tasnim Kabir,Dmytro Kurdydyk,Aadi Palnitkar,Liam Dorn,Ahmed Haj Ahmed,Jordan Lee Boyd-Graber
类目:Computation and Language (cs.CL)
关键词:Internet Trivia Authors, Diverse Internet Trivia, Understanding from Diverse, Diverse Internet, surface-level acoustic recognition
备注:
点击查看摘要
Abstract:Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.
15. 【2604.21751】Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs
链接:https://arxiv.org/abs/2604.21751
作者:Joseba Fernandez de Landa,Carla Perez-Almendros,Jose Camacho-Collados
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Western and Anglocentric, Anglocentric viewpoints, amplifying Western, coverage and competence, showing limitations
备注:
点击查看摘要
Abstract:LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ). The results show that, contrary to previous cultural bias work, LLMs show a clear tendency towards countries such as Japan. Moveover, our results show that when prompting in languages such as English or other high-resource ones, LLMs tend to provide more diverse outputs and show less inclinations towards answering questions highlighting countries for which the input language is an official language. Finally, we also investigate at which point of LLM training this cultural bias emerges, with our results suggesting that the first clear signs appear after supervised fine-tuning, and not during pre-training.
16. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs
链接:https://arxiv.org/abs/2604.21748
作者:Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events
备注: Accepted by ACL 2026 main conference
点击查看摘要
Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .
17. 【2604.21725】AEL: Agent Evolving Learning for Open-Ended Environments
链接:https://arxiv.org/abs/2604.21725
作者:Wujiang Xu,Jiaojiao Han,Minghao Guo,Kai Mei,Xi Zhu,Han Zhang,Dimitris N. Metaxas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
关键词:LLM agents increasingly, open-ended environments spanning, environments spanning hundreds, agents increasingly operate, remain largely stateless
备注:
点击查看摘要
Abstract:LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle. At the fast timescale, a Thompson Sampling bandit learns which memory retrieval policy to apply at each episode; at the slow timescale, LLM-driven reflection diagnoses failure patterns and injects causal insights into the agent's decision prompt, giving it an interpretive frame for the evidence it retrieves. On a sequential portfolio benchmark (10 sector-diverse tickers, 208 episodes, 5 random seeds), \ael{} achieves a Sharpe ratio of 2.13$\pm$0.47, outperforming five published self-improving methods and all non-LLM baselines while maintaining the lowest variance among all LLM-based approaches. A nine-variant ablation reveals a ``less is more'' pattern: memory and reflection together produce a 58\% cumulative improvement over the stateless baseline, yet every additional mechanism we test (planner evolution, per-tool selection, cold-start initialization, skill extraction, and three credit assignment methods) \emph{degrades} performance. This demonstrates that the bottleneck in agent self-improvement is \emph{self-diagnosing how to use} experience rather than adding architectural complexity. Code and data: this https URL.
18. 【2604.21724】Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling
链接:https://arxiv.org/abs/2604.21724
作者:Yilong Chen,Yanxi Xie,Zitian Gao,He Xin,Yihao Xiao,Renbiao Liu,Haoming Luo,Yifan Luo,Zhengmao Ye,Tingwen Liu,Xin Zhao,Ran Tao,Bryan Dai
类目:Computation and Language (cs.CL)
关键词:Large token-indexed lookup, poor parameter efficiency, Large token-indexed, compute-decoupled scaling path, rapid memory growth
备注: 29 pages, 9 figures, 13 tables
点击查看摘要
Abstract:Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in this https URL.
19. 【2604.21718】Building a Precise Video Language with Human-AI Oversight
链接:https://arxiv.org/abs/2604.21718
作者:Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:dynamic visual world, Video-language models, learn to reason, natural language, world through natural
备注: CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)
点击查看摘要
Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL
20. 【2604.21716】From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
链接:https://arxiv.org/abs/2604.21716
作者:Minh Duc Bui,Xenia Heilmann,Mattia Cerrato,Manuel Mager,Katharina von der Wense
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Prior work evaluates, reveal solely overt, work evaluates code, evaluates code generation, Prior work
备注: Accepted to ACL 2026 Findings
点击查看摘要
Abstract:Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.
21. 【2604.21706】Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers
链接:https://arxiv.org/abs/2604.21706
作者:Bernard Muller,Antonio Armando Ortiz Barrañón,LaVonne Roberts
类目:Computation and Language (cs.CL)
关键词:self-supervised speech representations, frozen self-supervised speech, severity assessment based, speech representations, previously introduced
备注: Submitted to Computer Speech Language
点击查看摘要
Abstract:We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings. First, aetiology-specific degradation profiles are distinguishable at the group level: 10 of 13 features yield large effect sizes (epsilon-squared 0.14, Holm-corrected p 0.001), with Parkinson's disease separable from the articulatory execution group at Cohen's d = 0.83; individual-level classification remains limited (22.6% macro F1). Second, profiles show cross-lingual profile-shape stability: cosine similarity of 5-dimensional consonant d-prime profiles exceeds 0.95 across the languages available for each aetiology. Absolute d-prime magnitudes are not cross-lingually calibrated, so the method supports language-independent phenotyping of degradation patterns but requires within-corpus calibration for absolute severity interpretation. Third, the method is architecture-independent: all 6 backbones produce monotonic severity gradients with inter-model agreement exceeding rho = 0.77. Fixed-token d-prime estimation preserves the severity correlation (rho = -0.733 at 200 tokens per class), confirming that the signal is not a token-count artefact. These results support phonological subspace analysis as a robust, training-free framework for aetiology-aware dysarthria characterisation, with evidence of cross-lingual profile-shape stability and cross-backbone robustness in the represented sample.
22. 【2604.21700】Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
链接:https://arxiv.org/abs/2604.21700
作者:Jiali Wei,Ming Fan,Guoheng Sun,Xicheng Zhang,Haijun Wang,Ting Liu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:raised urgent concerns, large language models, growing application, application of large, large language
备注:
点击查看摘要
Abstract:The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.
23. 【2604.21698】Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection
链接:https://arxiv.org/abs/2604.21698
作者:Marius Huber,David R. Reich,Lena A. Jäger
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Algebraic Topology (math.AT)
关键词:extracts robust, time series, Persistent homology, features, Copenhagen Corpus
备注: ETRA 2026
点击查看摘要
Abstract:Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features. We empirically evaluate our method by applying it to the task of dyslexia detection from eye-tracking-while-reading data using the Copenhagen Corpus, which contains scanpaths from dyslexic and non-dyslexic L1 and L2 readers. Our hybrid models outperform existing approaches that rely solely on traditional features, showing that persistent homology captures complementary information encoded in fixation sequences. The strength of these topological features is further underscored by their achieving performance comparable to established baseline methods. Importantly, our proposed filtrations outperform existing ones.
24. 【2604.21667】Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales
链接:https://arxiv.org/abs/2604.21667
作者:Olufunke O. Sarumi,Charles Welch,Daniel Braun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:exploring disaggregated labels, User Passport mechanism, representation-level User Passport, exploring disaggregated, User Passport
备注: Accepted at 5th NLPerspectives Workshop
点击查看摘要
Abstract:Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.
25. 【2604.21649】GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion
链接:https://arxiv.org/abs/2604.21649
作者:Qizhuo Xie,Yunhui Liu,Yu Xing,Qianzi Hou,Xudong Jin,Tao Zheng,Tieke He
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:shown immense potential, Knowledge Graph Completion, Large Language Models, LLM tokens remains, continuous graph embeddings
备注: ACL 2026
点击查看摘要
Abstract:Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at this https URL.
26. 【2604.21637】Multilinguality at the Edge: Developing Language Models for the Global South
链接:https://arxiv.org/abs/2604.21637
作者:Lester James V. Miranda,Songbo Hu,Roi Reichart,Anna Korhonen
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:deployed determines, Global South, language models, prevent effective deployment, hardware constrained communities
备注:
点击查看摘要
Abstract:Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete. Studying these two fields together is both a need, as linguistically diverse communities often face the most severe infrastructure constraints, and an opportunity, as edge and multilingual NLP research remain largely siloed. To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline, from data collection to development and deployment. We also discuss open questions and provide actionable recommendations for different stakeholders in the NLP ecosystem. Finally, we hope that this work contributes to the development of inclusive and equitable language technologies.
27. 【2604.21611】Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
链接:https://arxiv.org/abs/2604.21611
作者:Hao-Yuan Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM reasoning, Verbal Process Supervision, chain depth, sample breadth, GPQA Diamond
备注:
点击查看摘要
Abstract:Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
28. 【2604.21593】Language as a Latent Variable for Reasoning Optimization
链接:https://arxiv.org/abs/2604.21593
作者:Linjuan Wu,Haoran Wei,Jialong Tang,Shuang Luo,Baosong Yang,Yongliang Shen,Weiming Lu
类目:Computation and Language (cs.CL)
关键词:reduce English-centric bias, LLMs reduce English-centric, surprising trend emerges, English-centric bias, reduce English-centric
备注: 17 pages, 7 figures, Under Reviewing
点击查看摘要
Abstract:As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It generates polyglot preference data online under language-constrained and unconstrained conditions, optimizing the policy with respect to both answer accuracy and reasoning structure. Trained on only 18.1K multilingual math problems without chain-of-thought annotations, polyGRPO improves the base model (Qwen2.5-7B-Instruct) by 6.72% absolute accuracy on four English reasoning testset and 6.89% in their multilingual benchmark. Remarkably, it is the only method that surpasses the base LLM on English commonsense reasoning task (4.9%), despite being trained solely on math data-highlighting its strong cross-task generalization. Further analysis reveals that treating language as a latent variable expands the model's latent reasoning space, yielding consistent and generalizable improvements in reasoning performance.
29. 【2604.21590】AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use
链接:https://arxiv.org/abs/2604.21590
作者:Yuanjie Lyu,Chengyu Wang,Haonan Zheng,Yuanhao Yue,Junbing Yan,Ming Wang,Jun Huang
类目:Computation and Language (cs.CL)
关键词:Modern industrial applications, increasingly demand language, demand language models, Modern industrial, capable of multi-step
备注:
点击查看摘要
Abstract:Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: this https URL. Data synthesis and RL training code: this https URL. The data synthesis pipeline is also integrated into EasyDistill: this https URL.
30. 【2604.21564】Measuring Opinion Bias and Sycophancy via LLM-based Coercion
链接:https://arxiv.org/abs/2604.21564
作者:Rodrigo Nogueira,Giovana Kerche Bonás,Thales Sales Almeida,Andrea Roque,Ramon Pires,Hugo Abonizio,Thiago Laitz,Celio Larcher,Roseval Malaquias Junior,Marcos Piau
类目:Computation and Language (cs.CL)
关键词:Large language models, information people consume, Large language, language models increasingly, models increasingly shape
备注:
点击查看摘要
Abstract:Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.
31. 【2604.21555】Finding Meaning in Embeddings: Concept Separation Curves
链接:https://arxiv.org/abs/2604.21555
作者:Paul Keuren,Marc Ponsen,Robert Ayoub Bagheri
类目:Computation and Language (cs.CL)
关键词:embedding techniques aim, encode key concepts, Sentence embedding techniques, Concept Separation Curves, vector space
备注: The code is open source and located on github at [this https URL](https://github.com/pkun-cbs/ConceptSeparationCurves) . Original conference paper
点击查看摘要
Abstract:Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance. The approach adopted in this study involves the systematic introduction of syntactic noise and semantic negations into sentences, with the subsequent quantification of their relative effects on the resulting embeddings. The visualisation of these effects is facilitated by Concept Separation Curves, which show the model's capacity to differentiate between conceptual and surface-level variations. By leveraging data from multiple domains, employing both Dutch and English languages, and examining sentence lengths, this study offers a compelling demonstration that Concept Separation Curves provide an interpretable, reproducible, and cross-model approach for evaluating the conceptual stability of sentence embeddings.
32. 【2604.21534】UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text
链接:https://arxiv.org/abs/2604.21534
作者:Darya Hryhoryeva,Amaia Zurinaga,Hamidreza Jamalabadi,Iryna Gurevych
类目:Computation and Language (cs.CL)
关键词:paper presents, pairwise Maximum Entropy, task requires modeling, Maximum Entropy, presents our system
备注: Accepted to SemEval 2026 (co-located with ACL 2026)
点击查看摘要
Abstract:This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics. Our system ranked first among participating teams in both Subtask 1 and Subtask 2A based on the official evaluation metric.
33. 【2604.21525】Job Skill Extraction via LLM-Centric Multi-Module Framework
链接:https://arxiv.org/abs/2604.21525
作者:Guojing Li(1 and 2),Zichuan Fu(1),Junyi Li(1),Faxue Liu(1),Wenxia Zhou(2),Yejing Wang(1),Jingtong Gao(1),Maolin Wang(1),Rungen Liu(1),Wenlin Zhang(1),Xiangyu Zhao(1) ((1) City University of Hong Kong, (2) Renmin University of China)
类目:Computation and Language (cs.CL)
关键词:Span-level skill extraction, job advertisements underpins, advertisements underpins candidate-job, underpins candidate-job matching, yield malformed spans
备注: 5 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.
34. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
链接:https://arxiv.org/abs/2604.21523
作者:Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
35. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE
链接:https://arxiv.org/abs/2604.21511
作者:Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse
备注: 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025
点击查看摘要
Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.
36. 【2604.21510】OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
链接:https://arxiv.org/abs/2604.21510
作者:Xinyu Zhang,Boxuan Zhang,Yuchen Wan,Lingling Zhang,YiXing Yao,Bifan Wei,Yaqiang Wu,Jun Liu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, demonstrate remarkable reasoning, Large Language, requiring domain knowledge, tasks remain challenging
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.
37. 【2604.21496】How English Print Media Frames Human-Elephant Conflicts in India
链接:https://arxiv.org/abs/2604.21496
作者:Bonala Sai Punith,Salveru Jayati,Garima Shakya,Shubham Kumar Nigam
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:expanding human settlements, human settlements force, settlements force elephants, Human-elephant conflict, contact with people
备注:
点击查看摘要
Abstract:Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.
38. 【2604.21495】Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning
链接:https://arxiv.org/abs/2604.21495
作者:Hanjun Cho,Gahyun Yoo,Hanseong Kim,Jay-Yoon Lee
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:exhibits high in-domain, high in-domain accuracy, exhibits high, high in-domain, Numerical reasoning
备注: Accepted to TACL. This is a pre-MIT Press publication version
点击查看摘要
Abstract:Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner. By decoupling domain semantics and numerical operation structure, TaNOS improves the transferability of numerical reasoning. Applied to an 8B instruction-tuned model, TaNOS achieves 80.13% execution accuracy on FinQA with only 10% train data, outperforming SFT baseline (73.97%) with full train data and proprietary models such as GPT-5, Gemini-2.5-Pro. Furthermore, in the domain-shift experiments, TaNOS displays nearly-negligible cross-domain gap (2pp) when standard SFT shows over 10pp gap. These results suggest that structural guidance with operation sketches, header-agnostic representations, and correctness-guaranteed self-supervision can improve the robustness of numerical reasoning across diverse expert-domain tables.
39. 【2604.21481】Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
链接:https://arxiv.org/abs/2604.21481
作者:Srija Anand,Ashwin Sankar,Ishvinder Sethi,Aaditya Pareek,Kartik Rajput,Gaurav Yadav,Nikhil Narasimhan,Adish Pandya,Deepon Halder,Mohammed Safi Ur Rahman Khan,Praveen S V,Shobhit Banga,Mitesh M Khapra
类目:Computation and Language (cs.CL)
关键词:Crowdsourced pairwise evaluation, assessing foundation models, Crowdsourced pairwise, scalable approach, approach for assessing
备注:
点击查看摘要
Abstract:Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
40. 【2604.21469】Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection
链接:https://arxiv.org/abs/2604.21469
作者:Fariz Ikhwantri,Dusica Marijan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:challenging task due, regulatory compliance remains, legal texts, remains a challenging, complexity and variability
备注: 10 pages, 5 figures, 4 tables. 11th Special Session on Intelligent Data Mining, 2025 IEEE International Conference on Big Data
点击查看摘要
Abstract:Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval. We systematically vary the proportion of selected data to analyse its effect on cross-domain adaptation. Our findings demonstrate that targeted data selection substantially reduces negative transfer, offering a practical path toward scalable and reliable compliance automation across heterogeneous regulations.
41. 【2604.21454】Reasoning Primitives in Hybrid and Non-Hybrid LLMs
链接:https://arxiv.org/abs/2604.21454
作者:Shivam Rawat,Lucie Flek,Florian Mai,Nicholas Kluge Corrêa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, monolithic capability, basic operations, large language, observed gains
备注:
点击查看摘要
Abstract:Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model's effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.
42. 【2604.21446】AI-Gram: When Visual Agents Interact in a Social Network
链接:https://arxiv.org/abs/2604.21446
作者:Andrew Shin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
关键词:enabling image-based interactions, live platform enabling, platform enabling image-based, fully autonomous multi-agent, image-based interactions
备注:
点击查看摘要
Abstract:We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties. These results reveal a fundamental asymmetry in current agent architectures: strong expressive communication paired with a steadfast preservation of individual visual identity. We release AI-Gram as a publicly accessible, continuously evolving platform for studying social dynamics in Al-native multi-agent systems. this https URL
43. 【2604.21428】Decoupled DiLoCo for Resilient Distributed Pre-training
链接:https://arxiv.org/abs/2604.21428
作者:Arthur Douillard,Keith Rush,Yani Donchev,Zachary Charles,Nova Fallen,Ayush Dubey,Ionel Gog,Josef Dean,Blake Woodworth,Zachary Garrett,Nate Keating,Jenny Bishop,Henry Prior,Edouard Yvinec,Arthur Szlam,Marc'Aurelio Ranzato,Jeff Dean
类目:Computation and Language (cs.CL)
关键词:Modern large-scale language, pre-training relies heavily, requires tight coupling, Modern large-scale, program multiple data
备注:
点击查看摘要
Abstract:Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput. Decoupled DiLoCo partitions compute across multiple independent ``learners'' that execute local inner optimization steps. These learners asynchronously communicate parameter fragments to a central synchronizer, which circumvents failed or straggling learners by aggregating updates using a minimum quorum, an adaptive grace window, and dynamic token-weighted merging. Inspired by ``chaos engineering'', we achieve significantly improved training efficiency in failure-prone environments with millions of simulated chips with strictly zero global downtime, while maintaining competitive model performance across text and vision tasks, for both dense and mixture-of-expert architectures.
44. 【2604.21421】Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
链接:https://arxiv.org/abs/2604.21421
作者:Michele Miranda,Xinlan Yan,Nishant Mishra,Rachel Murphy,Ameen Abu-Hanna,Sébastien Bratières,Iacer Calixto
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:GDPR and HIPAA, Protecting patient privacy, Protecting patient, narratives is essential, essential for enabling
备注:
点击查看摘要
Abstract:Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.
45. 【2604.21380】Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation
链接:https://arxiv.org/abs/2604.21380
作者:Wang Shi Hai,Chen Tao
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:software performance requirements, software engineering, performance requirements, natural language, documented in natural
备注: 9 pages,accepted by ACL 2026
点击查看摘要
Abstract:Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.
46. 【2604.21375】VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
链接:https://arxiv.org/abs/2604.21375
作者:Qijun Han,Haoqin Tu,Zijun Wang,Haoyue Dai,Yiyang Zhou,Nancy Lau,Alvaro A. Cardenas,Yuhui Xu,Ran Xu,Caiming Xiong,Zeyu Zheng,Huaxiu Yao,Yuyin Zhou,Cihang Xie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:Autonomous GUI agents, Autonomous GUI, GUI agents face, agents prematurely declare, prematurely declare success
备注: The first two authors contribute equally
点击查看摘要
Abstract:Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.
47. 【2604.21370】MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization
链接:https://arxiv.org/abs/2604.21370
作者:Maziar Kianimoghadam Jouneghani
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:multilingual polarization detection, contrasting multilingual generalists, present a systematic, systematic study, polarization detection
备注: 9 pages, 9 tables. Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), Task 9
点击查看摘要
Abstract:We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: this https URL.
48. 【2604.21365】mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
链接:https://arxiv.org/abs/2604.21365
作者:Adam Skurla,Dominik Macko,Jakub Simko
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Multi-domain detection, challenging task, programming languages, machine-generated code snippets, Multi-domain
备注:
点击查看摘要
Abstract:Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task~13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.
49. 【2604.21357】ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
链接:https://arxiv.org/abs/2604.21357
作者:Jian Cui,Zhiyuan Ren,Desheng Weng,Yongqi Zhao,Gong Wenbin,Yu Lei,Zhenning Dong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:including workflow complexity, traditional multi-stage approaches, vector similarity retrieval, geographic knowledge bases, structured geographic knowledge
备注: 12 pages, 8 figures, submitted to ACM SIGSPATIAL 2024 (under review)
点击查看摘要
Abstract:This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.
50. 【2604.21352】CARE: Counselor-Aligned Response Engine for Online Mental-Health Support
链接:https://arxiv.org/abs/2604.21352
作者:Hagai Astrin,Ayal Swaid,Avi Segal,Kobi Gal
类目:Computation and Language (cs.CL)
关键词:Mental health challenges, increasing worldwide, challenges are increasing, services and leading, Mental health
备注: 9 pages, 4 figures
点击查看摘要
Abstract:Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts.
Comments:
9 pages, 4 figures
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2604.21352 [cs.CL]
(or
arXiv:2604.21352v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.21352
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
51. 【2604.21346】Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
链接:https://arxiv.org/abs/2604.21346
作者:Mohit Vaishnav,Tanel Tammet
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:main bottleneck lies, Bongard problems, language models, large language models, raising the question
备注:
点击查看摘要
Abstract:Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
52. 【2604.21345】Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline
链接:https://arxiv.org/abs/2604.21345
作者:Philip Zhong,Don Wang,Jason Zhang,Kent Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:reusable evaluation pipeline, public artifact package, artifact package derived, Dataset Pipeline, summaries and released
备注: AI Application Feature Quality Evaluation (28 pages total)
点击查看摘要
Abstract:We present a reusable evaluation pipeline for generative AI applications, instantiated for AI meeting summaries and released with a public artifact package derived from a Dataset Pipeline. The system separates reusable orchestration from task-specific semantics across five stages: source intake, structured reference construction, candidate generation, structured scoring, and reporting. Unlike standalone claim scorers, it treats both ground truth and evaluator outputs as typed, persisted artifacts, enabling aggregation, issue analysis, and statistical testing. We benchmark the offline loop on a typed dataset of 114 meetings spanning city_council, private_data, and whitehouse_press_briefings, producing 340 meeting-model pairs and 680 judge runs across gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this protocol, gpt-4.1-mini achieves the highest mean accuracy (0.583), while gpt-5.1 leads in completeness (0.886) and coverage (0.942). Paired sign tests with Holm correction show no significant accuracy winner but confirm significant retention gains for gpt-5.1. A typed DeepEval contrastive baseline preserves retention ordering but reports higher holistic accuracy, suggesting that reference-based scoring may overlook unsupported-specifics errors captured by claim-grounded evaluation. Typed analysis identifies whitehouse_press_briefings as an accuracy-challenging domain with frequent unsupported specifics. A deployment follow-up shows gpt-5.4 outperforming gpt-4.1 across all metrics, with statistically robust gains on retention metrics under the same protocol. The system benchmarks the offline loop and documents, but does not quantitatively evaluate, the online feedback-to-evaluation path.
Comments:
AI Application Feature Quality Evaluation (28 pages total)
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2604.21345 [cs.AI]
(or
arXiv:2604.21345v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.21345
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2604.21344】Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
链接:https://arxiv.org/abs/2604.21344
作者:Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:present complex information, complex information, present complex, Multimodal Language Models, multiple related charts
备注:
点击查看摘要
Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.
54. 【2604.21335】Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
链接:https://arxiv.org/abs/2604.21335
作者:Wei Jiang,Wei Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:finer control axis, prior work, offers a finer, finer control, control axis
备注: 16 pages, 14 tables, 2 figures
点击查看摘要
Abstract:Sub-token routing offers a finer control axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. The motivation is that a relevant token need not be internally uniform: under a retention budget, preserved value groups are distributed unevenly both across tokens and within tokens, which suggests that KV compression need not be an all-or-nothing decision at token level. We study this fine-grained routing mechanism in two settings. For compression-aware language modeling, we introduce a query-independent design that combines routed subspace LoRA with value-group routing on the KV path. For downstream-task-preserving KV compression, we introduce a query-aware design in which a predictor-based selector allocates a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves the quality-compression tradeoff for language modeling, while the query-aware design preserves downstream behavior under reduced KV budgets. We further examine the relation between token-level and sub-token-level query-aware routing, and show that they form complementary compression axes: token-level methods determine which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally.
55. 【2604.21334】Ideological Bias in LLMs' Economic Causal Reasoning
链接:https://arxiv.org/abs/2604.21334
作者:Donggyu Lee,Hyeok Yun,Jungwon Kim,Junsik Min,Sungwon Park,Sangyoon Park,Jihee Kim
类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
关键词:large language models, large language, bias when reasoning, exhibit systematic ideological, systematic ideological bias
备注:
点击查看摘要
Abstract:Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.
56. 【2604.21327】Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
链接:https://arxiv.org/abs/2604.21327
作者:Yongcan Yu,Lingxiao He,Jian Liang,Kuangpu Guo,Meng Wang,Qianlong Xie,Xingxing Wang,Ran He
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Test-time reinforcement learning, reinforcement learning, time via pseudo-labeling, leaving it vulnerable, Denoised test-time Reinforcement
备注: Accepted to ACL 2026 Findings
点击查看摘要
Abstract:Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at this https URL.
57. 【2604.21309】When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
链接:https://arxiv.org/abs/2604.21309
作者:Nannan Huang,Iffat Maab,Junichi Yamagishi
类目:Computation and Language (cs.CL)
关键词:processing vast daily, political perspectives critical, daily news content, diverse political perspectives, Multi-document news summarisation
备注: Accepted to ACL 2026 Main Conference
点击查看摘要
Abstract:Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.
58. 【2604.21308】CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
链接:https://arxiv.org/abs/2604.21308
作者:Wenjie Fu,Xiaoting Qin,Jue Zhang,Qingwei Lin,Lukas Wutschitz,Robert Sim,Saravan Rajmohan,Dongmei Zhang
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:improve workplace productivity, dramatically improve workplace, Enterprise LLM agents, LLM agents, workplace productivity
备注:
点击查看摘要
Abstract:Enterprise LLM agents can dramatically improve workplace productivity, but their core capability, retrieving and using internal context to act on a user's behalf, also creates new risks for sensitive information leakage. We introduce CI-Work, a Contextual Integrity (CI)-grounded benchmark that simulates enterprise workflows across five information-flow directions and evaluates whether agents can convey essential content while withholding sensitive context in dense retrieval settings. Our evaluation of frontier models reveals that privacy failures are prevalent (violation rates range from 15.8%-50.9%, with leakage reaching up to 26.7%) and uncovers a counterintuitive trade-off critical for industrial deployment: higher task utility often correlates with increased privacy violations. Moreover, the massive scale of enterprise data and potential user behavior further amplify this vulnerability. Simply increasing model size or reasoning depth fails to address the problem. We conclude that safeguarding enterprise workflows requires a paradigm shift, moving beyond model-centric scaling toward context-centric architectures.
59. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI
链接:https://arxiv.org/abs/2604.21300
作者:Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE
备注:
点击查看摘要
Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{this https URL} \footnote{this https URL}.
60. 【2604.21286】Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding
链接:https://arxiv.org/abs/2604.21286
作者:Jon-Paul Cacioli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:K-way energy probe, discriminative predictive coding, predictive coding networks, coding networks reduces, standard discriminative predictive
备注: 11 pages, 3 figures, 4 tables. Pre-registered on OSF ( [this https URL](https://osf.io/2kvsp) ). Code at [this https URL](https://github.com/synthiumjp/ima)
点击查看摘要
Abstract:Cacioli (2026) showed that the K-way energy probe on standard discriminative predictive coding networks reduces approximately to a monotone function of the log-softmax margin. The reduction rests on five assumptions, including cross-entropy (CE) at the output and effectively feedforward inference dynamics. This pre-registered study tests the reduction's sensitivity to CE removal using two conditions: standard PC trained with MSE instead of CE, and bidirectional PC (bPC; Oliviers, Tang Bogacz, 2025). Across 10 seeds on CIFAR-10 with a matched 2.1M-parameter backbone, we find three results. The negative result replicates on standard PC: the probe sits below softmax (Delta = -0.082, p 10^-6). On bPC the probe exceeds softmax across all 10 seeds (Delta = +0.008, p = 0.000027), though a pre-registered manipulation check shows that bPC does not produce materially greater latent movement than standard PC at this scale (ratio 1.6, threshold 10). Removing CE alone without changing inference dynamics halves the probe-softmax gap (Delta_MSE = -0.037 vs Delta_stdPC = -0.082). CE is a major empirically load-bearing component of the decomposition at this scale. CE training produces output logit norms approximately 15x larger than MSE or bPC training. A post-hoc temperature scaling ablation decomposes the probe-softmax gap into two components: approximately 66% is attributable to logit-scale effects removable by temperature rescaling, and approximately 34% reflects a scale-invariant ranking advantage of CE-trained representations. We use "metacognitive" operationally to denote Type-2 discrimination of a readout over its own Type-1 correctness, not to imply human-like introspective access.
61. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture
链接:https://arxiv.org/abs/2604.21284
作者:Robin Dey,Panyanon Viradecha
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci
备注: 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)
点击查看摘要
Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.
62. 【2604.21276】Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
链接:https://arxiv.org/abs/2604.21276
作者:Srishti Ginjala,Eric Fosler-Lussier,Christopher W. Myers,Srinivasan Parthasarathy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:critical question arises, text-derived priors make, models replace task-specific, large language models, language models replace
备注:
点击查看摘要
Abstract:As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.
63. 【2604.21265】Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training
链接:https://arxiv.org/abs/2604.21265
作者:Yoshinori Nomura
类目:Computation and Language (cs.CL)
关键词:accelerates language acquisition, language significantly accelerates, significantly accelerates language, significantly accelerates, language acquisition
备注: 17 pages, 3 figures
点击查看摘要
Abstract:We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music $\to$ poetry $\to$ prose -- yields a $17.5\%$ perplexity improvement over random initialization ($p 0.001$, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at $d\!=\!64$, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau ($p = 0.017$), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity ($-3\% \to +3\% \to +6\%$ advantage of larger datasets from $d\!=\!16$ to $d\!=\!64$). Across the scales we study ($d\!\in\!\{16,32,64\}$, up to ${\sim}400$K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.
64. 【2604.21255】When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
链接:https://arxiv.org/abs/2604.21255
作者:Chenghao Yang,Yuning Zhang,Zhoufutu Wen,Tao Gong,Jiaheng Liu,Qi Chu,Nenghai Yu
类目:Computation and Language (cs.CL)
关键词:progress of LLM, LLM agents, primary driver, rapid progress, Action Graph Similarity
备注: Accepted by ACL 2026 Main Conference
点击查看摘要
Abstract:Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $\tau$-Bench and $\tau^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at this https URL.
65. 【2604.21254】Hyperloop Transformers
链接:https://arxiv.org/abs/2604.21254
作者:Abbas Zeitoun,Lucas Torroba-Hennigen,Yoon Kim
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:research generally aims, maximize model quality, model quality subject, architecture research generally, LLM architecture research
备注:
点击查看摘要
Abstract:LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026), which expand the residual stream into matrix-valued residual streams. Hyper-connections are applied only after each loop, and therefore add minimal new parameters and compute cost. Across various model scales, we find that our Hyper-Connected Looped Transformer (Hyperloop Transformer) is able to outperform depth-matched Transformer and mHC Transformer baselines despite using approximately 50% fewer parameters. The outperformance persists through post-training weight quantization, thus positioning Hyperloop Transformers as an attractive architecture for memory-efficient language modeling.
66. 【2604.21253】Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation
链接:https://arxiv.org/abs/2604.21253
作者:Hanwen Gu,Chao Guo,Junle Wang,Wenda Xie,Yisheng Lv
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:producing monotonous scripts, contextual logical consistency, existing methods struggle, smooth character development, global narrative coherence
备注: Accepted to Findings of the Association for Computational Linguistics: ACL 2026
点击查看摘要
Abstract:While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.
67. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching
链接:https://arxiv.org/abs/2604.21238
作者:Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity
备注: Accepted by NLPCC 2025
点击查看摘要
Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.
68. 【2604.21235】Learning Dynamic Representations and Policies from Multimodal Clinical Time-Series with Informative Missingness
链接:https://arxiv.org/abs/2604.21235
作者:Zihan Liang,Ziwen Pan,Ruoxuan Xiong
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
关键词:offering rich temporal, rich temporal information, offering rich, Multimodal clinical records, rich temporal
备注: Findings of ACL 2026 (30 pages)
点击查看摘要
Abstract:Multimodal clinical records contain structured measurements and clinical notes recorded over time, offering rich temporal information about the evolution of patient health. Yet these observations are sparse, and whether they are recorded depends on the patient's latent condition. Observation patterns also differ across modalities, as structured measurements and clinical notes arise under distinct recording processes. While prior work has developed methods that accommodate missingness in clinical time series, how to extract and use the information carried by the observation process itself remains underexplored. We therefore propose a patient representation learning framework for multimodal clinical time series that explicitly leverages informative missingness. The framework combines (1) a multimodal encoder that captures signals from structured and textual data together with their observation patterns, (2) a Bayesian filtering module that updates a latent patient state over time from observed multimodal signals, and (3) downstream modules for offline treatment policy learning and patient outcome prediction based on the learned patient state. We evaluate the framework on ICU sepsis cohorts from MIMIC-III, MIMIC-IV, and eICU. It improves both offline treatment policy learning and adverse outcome prediction, achieving FQE 0.679 versus 0.528 for clinician behavior and AUROC 0.886 for post-72-hour mortality prediction on MIMIC-III.
69. 【2604.21229】EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
链接:https://arxiv.org/abs/2604.21229
作者:Julian Acuna
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language model, Large language, language model assistants, assistants are increasingly, increasingly expected
备注: 9 pages, 2 figures, 3 tables
点击查看摘要
Abstract:Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.
70. 【2604.20996】AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
链接:https://arxiv.org/abs/2604.20996
作者:Tadesse Destaw Belay,Shahriar Kabir Nahin,Israel Abebe Azime,Ocean Monjur,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam,Anshuman Chhabra
类目:Computation and Language (cs.CL)
关键词:lack sufficient training, lack sufficient, Direct Preference Optimization, language learning systems, sufficient training resources
备注:
点击查看摘要
Abstract:How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at this https URL.
71. 【2604.20995】Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
链接:https://arxiv.org/abs/2604.20995
作者:Inderjeet Nair,Jie Ruan,Lu Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:tools remain limited, poorly understood phenomenon, current diagnostic tools, diagnostic tools remain, model behaves aligned
备注: Under submission at COLM 2026 Won the Best Student Paper Award at MSLD 2026 @ UIUC
点击查看摘要
Abstract:Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediately. As a result, models never deliberate over developer policy, monitoring conditions, or the consequences of non-compliance, making these diagnostics fundamentally unable to detect alignment faking propensity. To support study of this phenomenon, we first introduce VLAF, a diagnostic framework grounded in the hypothesis that alignment faking is most likely when developer policy conflicts with a model's strongly held values. VLAF uses morally unambiguous scenarios to probe this conflict across diverse moral values, bypassing refusal behavior while preserving meaningful deliberative stakes. Using VLAF, we find that alignment faking is substantially more prevalent than previously reported, occurring in models as small as 7B parameters - with olmo2-7b-instruct faking alignment in 37% of this http URL, we show that oversight conditions induce activation shifts that lie along a single direction in representation space. This means the behavioral divergence driving alignment faking can be captured by a single contrastive steering vector, which we exploit for lightweight inference-time mitigation. Finally, we exploit this for mitigation that requires no labeled data and minimal computational overhead, achieving relative reductions in alignment faking of 85.8%, 94.0%, and 57.7% on olmo2-7b-instruct, olmo2-13b-instruct, and qwen3-8b respectively.
72. 【2604.20994】Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
链接:https://arxiv.org/abs/2604.20994
作者:Yannis Belkhiter,Giulio Zizzo,Sergio Maffeis,Seshu Tirupathi,John D. Kelleher
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:calling Large Language, Large Language Models, Large Language, drawn significant attention, function calling Large
备注:
点击查看摘要
Abstract:The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.
73. 【2604.20983】hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
链接:https://arxiv.org/abs/2604.20983
作者:Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Vision evaluations, Vision, multi-step processes, visual, Multimodal Large Language
备注: Accepted at ACL 2026 Findings
点击查看摘要
Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
74. 【2604.20917】he Path Not Taken: Duality in Reasoning about Program Execution
链接:https://arxiv.org/abs/2604.20917
作者:Eshgin Hasanov,Md Mahadi Hassan Sibat,Santu Karmaker,Aashish Yadavally
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
关键词:Large language models, shown remarkable capabilities, Large language, diverse coding tasks, shown remarkable
备注: Accepted to ACL 2026 Main Conference
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.
75. 【2604.20915】Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
链接:https://arxiv.org/abs/2604.20915
作者:Zhixin Zhang,Shabo Zhang,Chengcan Wu,Zeming Wei,Meng Sun
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE); Optimization and Control (math.OC)
关键词:high computational cost, long streams prohibited, Transformers suffer, length for self-attention, high computational
备注:
点击查看摘要
Abstract:Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context absorption and generalization. Experiments on long-context and streaming benchmarks show that Absorber LLM reduces inference memory and improves accuracy over prior parameter-as-memory baselines.
信息检索
1. 【2604.21750】Multistakeholder Impacts of Profile Portability in a Recommender Ecosystem
链接:https://arxiv.org/abs/2604.21750
作者:Anas Buhayh,Elizabeth McKinnie,Clement Canel,Robin Burke
类目:Information Retrieval (cs.IR)
关键词:Optimizing outcomes, developing multi-objective models, stakeholders in recommender, recommender systems, systems has historically
备注: 34th ACM Conference on User Modeling, Adaptation and Personalization
点击查看摘要
Abstract:Optimizing outcomes for multiple stakeholders in recommender systems has historically focused on algorithmic interventions, such as developing multi-objective models or re-ranking results from existing algorithms. However, structural changes to the recommendation ecosystem itself remain understudied. This paper explores the implications of algorithmic pluralism (also known as "middleware" in the governance literature), in which recommendation algorithms are decoupled from platforms, enabling users to select their preferred algorithm. Prior simulation work demonstrates that algorithmic choice benefits niche consumers and providers. Yet this approach raises critical questions about user modeling in the context of data portability: when users switch algorithms, what happens to their data? Noting that multiple data portability regulations have emerged to strengthen user data ownership and control. We examine how such policies affect user models and stakeholders' outcomes in recommendation setting. Our findings reveal that data portability scenarios produce varying effects on user utility across different recommendation algorithms. We highlight key policy considerations and implications for designing equitable recommendation ecosystems.
2. 【2604.21748】StructMem: Structured Memory for Long-Horizon Behavior in LLMs
链接:https://arxiv.org/abs/2604.21748
作者:Buqiang Xu,Yijun Chen,Jizhan Fang,Ruobin Zhong,Yunzhi Yao,Yuqi Zhu,Lun Du,Shumin Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Long-term conversational agents, multi-hop question answering, Long-term conversational, support temporal reasoning, relationships between events
备注: Accepted by ACL 2026 main conference
点击查看摘要
Abstract:Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see this https URL .
3. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection
链接:https://arxiv.org/abs/2604.21694
作者:Katarzyna Fojcik
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation
备注:
点击查看摘要
Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.
4. 【2604.21675】Counterfactual Multi-task Learning for Delayed Conversion Modeling in E-commerce Sales Pre-Promotion
链接:https://arxiv.org/abs/2604.21675
作者:Xin Song,Kaiyuan Li,Jinxin Hu
类目:Information Retrieval (cs.IR)
关键词:e-commerce marketing strategies, modern e-commerce marketing, Sales promotions, stimulate product purchases, play a pivotal
备注: 6 pages, accepted by 49th International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'26)
点击查看摘要
Abstract:Sales promotions, as short-term incentives to stimulate product purchases, play a pivotal role in modern e-commerce marketing strategies. During promotional events, user behavior patterns exhibit distinct characteristics compared to regular periods. In the pre-promotion phase, users typically engage in product search and browsing without immediate purchases, adding items to carts in anticipation of promotional discounts. This behavior leads to delayed conversions, resulting in significantly lower conversion rates (CVR) before the promotion day. Although existing research has made progress in CVR prediction for promotion days using historical data, it largely overlooks the critical pre-promotion period. And delayed feedback modeling has been extensively studied, current approaches fail to account for the unique distribution shifts in conversion behavior before promotional events, where delayed conversions predominantly occur on the promotion day rather than over continuous time windows. To address these limitations, we propose the Counterfactual Multi-task Delayed Conversion Model (CM-DCM), which leverages historical pre-promotion data to enhance CVR prediction for both delayed and direct conversions. Our model incorporates three key innovations: (i) A multi-task architecture that jointly models direct and delayed conversions using historical pre-promotion data; (ii) A personalized user behavior gating module to mitigate data sparsity issues during brief pre-promotion periods; (iii) A counterfactual causal approach to model the transition probability from add-to-cart (ATC) to delayed conversion. Extensive experiments demonstrate that CM-DCM outperforms baselines in pre-promotion scenarios. Online A/B tests during major promotional events showed significant improvements in advertising revenue, delayed conversion GMV, and overall GMV, validating the effectiveness of our approach.
5. 【2604.21536】Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
链接:https://arxiv.org/abs/2604.21536
作者:Nikita Severin,Danil Kartushov,Vladislav Urzhumov,Vladislav Kulikov,Oksana Konovalova,Alexey Grishanov,Anton Klenitskiy,Artem Fatkulin,Alexey Vasilev,Andrey Savchenko,Ilya Makarov
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:achieved significant success, modeling temporal user, capturing rich user, temporal user behavior, rich user semantics
备注: Accepted to ECIR 2026. 7 pages. This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: [this http URL](http://dx.doi.org/10.1007/978-3-032-21300-6_42)
点击查看摘要
Abstract:Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.
6. 【2604.21511】From Tokens to Concepts: Leveraging SAE for SPLADE
链接:https://arxiv.org/abs/2604.21511
作者:Yuxuan Zong,Mathias Vast,Basile Van Cooten,Laure Soulier,Benjamin Piwowarski
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:excellent efficiency-effectiveness tradeoff, offer an excellent, efficiency-effectiveness tradeoff, excellent efficiency-effectiveness, Learned Sparse
备注: 11 pages, 3 figures, 9 tables. To appear at SIGIR 2025
点击查看摘要
Abstract:Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.
7. 【2604.21305】WPGRec: Wavelet Packet Guided Graph Enhanced Sequential Recommendation
链接:https://arxiv.org/abs/2604.21305
作者:Peilin Liu,Zhiquan Ji,Gang Yan
类目:Information Retrieval (cs.IR)
关键词:model users' evolving, users' evolving interests, localized behavioral fluctuations, non-stationary interaction streams, Sequential recommendation aims
备注: Accepted to SIGIR 2026, 8 pages, 3 figures
点击查看摘要
Abstract:Sequential recommendation aims to model users' evolving interests from noisy and non-stationary interaction streams, where long-term preferences, short-term intents, and localized behavioral fluctuations may coexist across temporal scales. Existing frequency-domain methods mainly rely on either global spectral operations or filter-based wavelet processing. However, global spectral operations tend to entangle local transients with long-range dependencies, while filter-based wavelet pipelines may suffer from temporal misalignment and boundary artifacts during multi-scale decomposition and reconstruction. Moreover, collaborative signals from the user-item interaction graph are often injected through scale-inconsistent auxiliary modules, limiting the benefit of jointly modeling temporal dynamics and structural dependencies. To address these issues, we propose Wavelet Packet Guided Graph Enhanced Sequential Recommendation (WPGRec), a unified time-frequency and graph-enhanced framework that aligns multi-resolution temporal modeling with graph propagation at matching scales. WPGRec first applies a full-tree undecimated stationary wavelet packet transform to generate equal-length, shift-invariant subband sequences. It then performs subband-wise interaction-graph propagation to inject high-order collaborative information while preserving temporal alignment across resolutions. Finally, an energy- and spectral-flatness-aware gated fusion module adaptively aggregates informative subbands and suppresses noise-like components. Extensive experiments on four public benchmarks show that WPGRec consistently outperforms sequential and graph-based baselines, with particularly clear gains on sparse and behaviorally complex datasets, highlighting the effectiveness of band-consistent structure injection and adaptive subband fusion for sequential recommendation.
8. 【2604.21304】PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
链接:https://arxiv.org/abs/2604.21304
作者:Yanjun Zhao,Tianxin Wei,Jiaru Zou,Xuying Ning,Yuanchen Bei,Lingjie Chen,Simmi Rana,Wendy H. Yang,Hanghang Tong,Jingrui He
类目:Information Retrieval (cs.IR)
关键词:answering isolated questions, summarizing content, scientific, questions or summarizing, scientific papers requires
备注:
点击查看摘要
Abstract:Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// this http URL.
9. 【2604.21300】Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI
链接:https://arxiv.org/abs/2604.21300
作者:Hieu Man,Van-Cuong Pham,Nghia Trung Ngo,Franck Dernoncourt,Thien Huu Nguyen
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Variational Autoencoder, Authorship Variational Autoencoder, Explainable Authorship Variational, Learning robust representations, EAVAE
备注:
点击查看摘要
Abstract:Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available online\footnote{this https URL} \footnote{this https URL}.
10. 【2604.21284】Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture
链接:https://arxiv.org/abs/2604.21284
作者:Robin Dey,Panyanon Viradecha
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:large language models, requiring any LLM, LLM inference, organize long-term memory, method of loci
备注: 20 pages, 10 tables. Code and data at [this https URL](https://github.com/web3guru888/mempalace-scientific-analysis)
点击查看摘要
Abstract:MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings-Rooms-Closets-Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.
11. 【2604.21238】Unlocking the Power of Large Language Models for Multi-table Entity Matching
链接:https://arxiv.org/abs/2604.21238
作者:Yingkai Tang,Taoyu Su,Wenyuan Zhang,Xiaoyang Guo,Tingwen Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:enabling simultaneous identification, Multi-table entity matching, addresses the limitations, unique identifiers, Multi-table entity
备注: Accepted by NLPCC 2025
点击查看摘要
Abstract:Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at this https URL.
12. 【2604.21019】Following the Eye-Tracking Evidence: Established Web-Search Assumptions Fail in Carousel Interfaces
链接:https://arxiv.org/abs/2604.21019
作者:Jingwei Kang,Maarten de Rijke,Harrie Oosterhuis
类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
关键词:streaming media services, Carousel interfaces, interfaces, Carousel, de-facto standard
备注:
点击查看摘要
Abstract:Carousel interfaces have been the de-facto standard for streaming media services for over a decade. Yet, there has been very little research into user behavior with such interfaces, which thus remains poorly understood. Due to this lack of empirical research, previous work has assumed that behaviors established in single-list web-search interfaces, such as the F-pattern and the examination hypothesis, also apply to carousel interfaces, for instance when designing click models or evaluation metrics. We analyze a recently-released interaction and examination dataset resulting from an eye-tracking study performed on carousel interfaces to verify whether these assumptions actually hold. We find that (i)~the F-pattern holds only for vertical examination and not for horizontal swiping; additionally, we discover that, when conditioned on a click, user examination follows an L-pattern unique to carousel interfaces; (ii)~click-through-rates conditioned on examination indicate that the well-known examination hypothesis does not hold in carousel interfaces; and (iii)~contrary to the assumptions of previous work, users generally ignore carousel headings and focus directly on the content items. Our findings show that many user behavior assumptions, especially concerning examination patterns, do not transfer from web search interfaces to carousel recommendation settings. Our work shows that the field lacks a reliable foundation on which to build models of user behavior with these interfaces. Consequently, a re-evaluation of existing metrics and click models for carousel interfaces may be warranted.
Subjects:
Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2604.21019 [cs.IR]
(or
arXiv:2604.21019v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2604.21019
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
计算机视觉
1. 【2604.21931】Seeing Fast and Slow: Learning the Flow of Time in Videos
链接:https://arxiv.org/abs/2604.21931
作者:Yen-Siang Wu,Rundong Luo,Jingsen Zhu,Tao Tu,Ali Farhadi,Matthew Wallingford,Yu-Chiang Frank Wang,Steve Marschner,Wei-Chiu Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:temporal, videos, time, video, Abstract
备注: Project page: [this https URL](https://seeing-fast-and-slow.github.io/)
点击查看摘要
Abstract:How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.
2. 【2604.21926】Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
链接:https://arxiv.org/abs/2604.21926
作者:Hao-Yu Hsu,Tianhang Cheng,Jing Wen,Alexander G. Schwing,Shenlong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surrounding environments typically, environments typically relies, cameras pose persistent, pose persistent challenges, energy efficiency
备注: Project page: [this https URL](https://tianhang-cheng.github.io/IMU4D)
点击查看摘要
Abstract:Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.
3. 【2604.21921】Context Unrolling in Omni Models
链接:https://arxiv.org/abs/2604.21921
作者:Ceyuan Yang,Zhijie Lin,Yang Zhao,Fei Xiao,Hao He,Qi Zhao,Chaorui Deng,Kunchang Li,Zihan Ding,Yuwei Guo,Fuyun Wang,Fangqi Zhu,Xiaonan Nie,Shenhan Zhu,Shanchuan Lin,Hongsheng Li,Weilin Huang,Guang Shi,Haoqi Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:model natively trained, enables Context Unrolling, natively trained, trained on diverse, Context Unrolling
备注: Report
点击查看摘要
Abstract:We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.
4. 【2604.21915】Vista4D: Video Reshooting with 4D Point Clouds
链接:https://arxiv.org/abs/2604.21915
作者:Kuan Heng Lin,Zhizheng Liu,Pablo Salamanca,Yash Kant,Ryan Burgert,Yuancheng Xu,Koichi Namekata,Yiwei Zhao,Bolei Zhou,Micah Goldblum,Paul Debevec,Ning Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flexible video reshooting, video reshooting framework, robust and flexible, framework that grounds, input video
备注: 24 pages, 20 figures, CVPR 2026, see project page at [this https URL](https://eyeline-labs.github.io/Vista4D)
点击查看摘要
Abstract:We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories. We build a 4D-grounded point cloud representation with static pixel segmentation and 4D reconstruction to explicitly preserve seen content and provide rich camera signals, and we train with reconstructed multiview dynamic data for robustness against point cloud artifacts during real-world inference. Our results demonstrate improved 4D consistency, camera control, and visual quality compared to state-of-the-art baselines under a variety of videos and camera paths. Moreover, our method generalizes to real-world applications such as dynamic scene expansion and 4D scene recomposition. See our project page for results, code, and models: this https URL
5. 【2604.21911】When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
链接:https://arxiv.org/abs/2604.21911
作者:Pegah Khayatan,Jayneel Parekh,Arnaud Dapogny,Mustafa Shukor,Alasdair Newson,Matthieu Cord
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:systems remain vulnerable, large vision-language models, impressive progress, progress in capabilities, capabilities of large
备注:
点击查看摘要
Abstract:Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at this https URL .
6. 【2604.21909】Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision
链接:https://arxiv.org/abs/2604.21909
作者:Leyla Roksan Caglar,Pedro A.M. Mediano,Baihan Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
关键词:reach similar classification, similar classification accuracy, kinds of mistakes, modern vision models, reach similar
备注:
点击查看摘要
Abstract:Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.
7. 【2604.21904】UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection
链接:https://arxiv.org/abs/2604.21904
作者:Yanran Zhang,Wenzhao Zheng,Yifei Li,Bingyao Yu,Yu Zheng,Lei Chen,Jiwen Lu,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generated image detection, image detection, recent years, significant progress, image generation
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{this https URL}{this https URL}.
8. 【2604.21879】Addressing Image Authenticity When Cameras Use Generative AI
链接:https://arxiv.org/abs/2604.21879
作者:Umar Masud,Abhijith Punnappurath,Luxi Zhao,David B. Lindell,Michael S. Brown
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:images shared online, image, photorealistically alter camera, methods to photorealistically, shared online
备注: To appear in CVPR 2026 Workshop on Authenticity and Provenance in the Age of Generative AI
点击查看摘要
Abstract:The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras. Hallucinated capture-time image content is typically benign, such as enhanced edges or texture, but in certain operations, such as AI-based digital zoom or low-light image enhancement, hallucinations can potentially alter the semantics and interpretation of the image content. As a result, users may not realize that the content in their camera images is not authentic. This paper addresses this issue by enabling users to recover the 'unhallucinated' version of the camera image to avoid misinterpretation of the image content. Our approach works by optimizing an image-specific multi-layer perceptron (MLP) decoder together with a modality-specific encoder so that, given the camera image, we can recover the image before hallucinated content was added. The encoder and MLP are self-contained and can be applied post-capture to the image without requiring access to the camera ISP. Moreover, the encoder and MLP decoder require only 180 KB of storage and can be readily saved as metadata within standard image formats such as JPEG and HEIC.
9. 【2604.21873】Grounding Video Reasoning in Physical Signals
链接:https://arxiv.org/abs/2604.21873
作者:Alibay Osmanli,Zixu Cheng,Shaogang Gong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Physical video understanding, video understanding requires, Physical video, event correctly, understanding requires
备注: Benchmark for Grounding Video Reasoning in Physical Signals
点击查看摘要
Abstract:Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video QA reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.
10. 【2604.21814】Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
链接:https://arxiv.org/abs/2604.21814
作者:Bowen Liu,Li Yang,Shanshan Song,Mingyu Tang,Zhifang Gao,Qifeng Chen,Yangqiu Song,Huimin Chen,Xiaomeng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enables non-invasive gastrointestinal, leaving video-level analysis, video-level analysis underexplored, remains largely limited, Capsule endoscopy
备注:
点击查看摘要
Abstract:Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.
11. 【2604.21810】Multiscale Super Resolution without Image Priors
链接:https://arxiv.org/abs/2604.21810
作者:Daniel Fu,Gabby Litterio,Pedro Felzenszwalb,Rashid Zia
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:address the ambiguities, super-resolution problem, pixel sizes, super-resolution, pixel
备注:
点击查看摘要
Abstract:We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens). We show that images acquired with pairwise coprime pixel sizes lead to a system with a stable inverse, and furthermore, that super-resolution images can be reconstructed efficiently using Fourier domain techniques or iterative least squares methods. Our mathematical analysis provides an expression for the expected error of the least squares reconstruction for large signals assuming i.i.d. noise that elucidates the noise-resolution tradeoff. These results are validated through both one- and two-dimensional experiments that leverage charge-coupled device (CCD) hardware binning to explore reconstructions over a large range of effective pixel sizes. Finally, two-dimensional reconstructions for a series of targets are used to demonstrate the advantages of multiscale super-resolution, and implications of these results for common imaging systems are discussed.
12. 【2604.21806】EMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
链接:https://arxiv.org/abs/2604.21806
作者:Zixu Li,Yupeng Hu,Zhiheng Fu,Zhiwei Chen,Yongqi Li,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Composed Image Retrieval, Composed Image, important image retrieval, image retrieval paradigm, Insufficient Entity Coverage
备注: Accepted by ACL 2026
点击查看摘要
Abstract:Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at this https URL.
13. 【2604.21801】SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery
链接:https://arxiv.org/abs/2604.21801
作者:Safouane El Ghazouali,Nicola Venturi,Michael Rueegsegger,Umberto Michelucci
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent advances, sensing rely heavily, large annotated datasets, tasks remains costly, acquiring high-quality ground
备注:
点击查看摘要
Abstract:Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: this https URL.
14. 【2604.21786】From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
链接:https://arxiv.org/abs/2604.21786
作者:Katharina Prasse,Steffen Jung,Isaac Bravo,Stefanie Walter,Patrick Knab,Christian Bartelt,Margret Keuper
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:communication strategies mobilise, strategies mobilise public, mobilise public concern, Social media platforms, Social media
备注:
点击查看摘要
Abstract:Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at this https URL.
15. 【2604.21776】Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
链接:https://arxiv.org/abs/2604.21776
作者:Avinash Paliwal,Adithya Iyer,Shivin Yadav,Muhammad Ali Afridi,Midhun Harikumar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paired multi-view data, Precise camera control, Precise camera, severe scarcity, scarcity of paired
备注:
点击查看摘要
Abstract:Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.
16. 【2604.21772】Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation
链接:https://arxiv.org/abs/2604.21772
作者:Yingkai Yang,Chaoqi Chen,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Continual Test-Time Adaptation, Open-set Continual Test-Time, mitigate distributional shifts, term Open-set Continual, Test-Time Adaptation
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.
17. 【2604.21760】Interpretable facial dynamics as behavioral and perceptual traces of deepfakes
链接:https://arxiv.org/abs/2604.21760
作者:Timothy Joseph Murphy,Jennifer Cook,Hélio Clemente José Cuve
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:strong benchmark performance, deep learning approaches, offer limited insight, benchmark performance, manipulated facial behavior
备注: Main paper: 19 pages, 5 figures, 4 tables. SI Appendix: 11 pages, 3 figures, 6 tables
点击查看摘要
Abstract:Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.
18. 【2604.21743】Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement
链接:https://arxiv.org/abs/2604.21743
作者:Dat To-Thanh,Nghia Nguyen-Trong,Hoang Vo,Hieu Bui-Minh,Tinh-Anh Nguyen-Nhu
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:fast processing speeds, processing speeds required, balance high output, Image enhancement models, image enhancement model
备注: 10 pages, 3 figures. Accepted at the Mobile AI (MAI) 2026 Workshop at CVPR 2026
点击查看摘要
Abstract:Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features. Moreover, we incorporate Quantization-Aware Training (QAT) to simulate the effects of low-precision representation during the training process. This allows the network to adapt and prevents the typical drop in quality seen with standard post-training quantization (PTQ). Experimental results demonstrate that the proposed method produces high-fidelity visual output while maintaining the low computational overhead needed for practical use on standard mobile devices. The code will be available at this https URL.
19. 【2604.21728】Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
链接:https://arxiv.org/abs/2604.21728
作者:Wenxuan Bao,Yanjun Zhao,Xiyuan Yang,Jingrui He
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Pretrained vision-language models, CLIP exhibit strong, Pretrained vision-language, CLIP exhibit, exhibit strong zero-shot
备注: Accepted by CVPR 2026 (Findings Track)
点击查看摘要
Abstract:Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at this https URL .
20. 【2604.21718】Building a Precise Video Language with Human-AI Oversight
链接:https://arxiv.org/abs/2604.21718
作者:Zhiqiu Lin,Chancharik Mitra,Siyuan Cen,Isaac Li,Yuhan Huang,Yu Tong Tiffany Ling,Hewei Wang,Irene Pi,Shihang Zhu,Ryan Rao,George Liu,Jiaxi Li,Ruojin Li,Yili Han,Yilun Du,Deva Ramanan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:dynamic visual world, Video-language models, learn to reason, natural language, world through natural
备注: CVPR 2026 Highlight. Project page: [this https URL](https://linzhiqiu.github.io/papers/chai/)
点击查看摘要
Abstract:Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: this https URL
21. 【2604.21713】Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
链接:https://arxiv.org/abs/2604.21713
作者:Guangkai Xu,Hua Geng,Huanyi Zheng,Songyi Yin,Yanlong Sun,Hao Chen,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made rapid progress, recently made rapid, Feed-forward visual geometry, visual geometry estimation, rapid progress
备注: Accepted to CVPR 2026. GitHub Page: [this https URL](https://github.com/aim-uofa/CARVE)
点击查看摘要
Abstract:Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
22. 【2604.21712】Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery
链接:https://arxiv.org/abs/2604.21712
作者:Yang Liu,Zhiyong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:monocular RGB images, RGB images aims, estimate anatomically plausible, monocular RGB, RGB images
备注:
点击查看摘要
Abstract:3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.
23. 【2604.21694】Efficient Logic Gate Networks for Video Copy Detection
链接:https://arxiv.org/abs/2604.21694
作者:Katarzyna Fojcik
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:diverse visual distortions, Video copy detection, detection requires robust, requires robust similarity, robust similarity estimation
备注:
点击查看摘要
Abstract:Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections. After training, the model can be discretized into a purely Boolean circuit, enabling extremely fast and memory-efficient inference. We systematically evaluate different similarity strategies, binarization schemes, and LGN architectures across multiple dataset folds and difficulty levels. Experimental results demonstrate that LGN-based models achieve competitive or superior accuracy and ranking performance compared to prior models, while producing descriptors several orders of magnitude smaller and delivering inference speeds exceeding 11k samples per second. These findings indicate that logic-based models offer a promising alternative for scalable and resource-efficient video copy detection.
24. 【2604.21689】StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
链接:https://arxiv.org/abs/2604.21689
作者:Kwan Yun,Changmin Lee,Ayeong Jeong,Youngseo Kim,Seungmi Lee,Junyong Noh
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
关键词:Creative face stylization, diverse visual idioms, Creative face, face stylization aims, retaining recognizable identity
备注: SIGGRAPH 2026 / ACM TOG. Project page at [this https URL](https://kwanyun.github.io/StyleID_page/)
点击查看摘要
Abstract:Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at this https URL
25. 【2604.21686】WorldMark: A Unified Benchmark Suite for Interactive Video World Models
链接:https://arxiv.org/abs/2604.21686
作者:Xiaojie Xu,Zhengyuan Lin,Kang He,Yukang Feng,Xiaofeng Mao,Yuanyang Yin,Kaipeng Zhang,Yongtao Ge
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:making fair cross-model, Interactive video generation, cross-model comparison impossible, video generation models, fair cross-model comparison
备注:
点击查看摘要
Abstract:Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (this http URL), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.
26. 【2604.21681】Sapiens2
链接:https://arxiv.org/abs/2604.21681
作者:Rawal Khirodkar,He Wen,Julieta Martinez,Yuan Dong,Su Zhaoen,Shunsuke Saito
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human-centric vision focused, focused on generalization, family of high-resolution, high-resolution transformers, transformers for human-centric
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: this https URL
27. 【2604.21668】Encoder-Free Human Motion Understanding via Structured Motion Descriptions
链接:https://arxiv.org/abs/2604.21668
作者:Yao Zhang,Zhuchenyang Liu,Thomas Ploetz,Yu Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:text-based large language, advancing rapidly, human motion understanding, including motion question, text-based large
备注:
点击查看摘要
Abstract:The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at this https URL.
28. 【2604.21654】Causal Disentanglement for Full-Reference Image Quality Assessment
链接:https://arxiv.org/abs/2604.21654
作者:Zhen Zhang,Jielei Chu,Tian Zhang,Weide Liu,Fengmao Lv,Tianrui Li,Jun Cheng,Yuming Fang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:deep network-based full-reference, performing pairwise comparisons, Existing deep network-based, models typically work, network-based full-reference image
备注:
点击查看摘要
Abstract:Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.
29. 【2604.21631】DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures
链接:https://arxiv.org/abs/2604.21631
作者:Xu Wang,Zhiru Wang,Shiyun Xie,Chengwei Pan,Yisong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, real-time photorealistic rendering, violate multi-view consistency, achieves real-time photorealistic, performance degrades significantly
备注: 10 pages,6 figures, accepted to Computer Vision and Pattern Recognition Conference 2026
点击查看摘要
Abstract:While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.
30. 【2604.21627】DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion
链接:https://arxiv.org/abs/2604.21627
作者:Tahar Chettaoui,Eduarda Caldeira,Guray Ozgur,Raghavendra Ramachandra,Fadi Boutros,Naser Damer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:anticipate evolving threats, develop robust defensive, Advancing face morphing, robust defensive mechanisms, Advancing face
备注: Accepted At CVPR-W 2026
点击查看摘要
Abstract:Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.
31. 【2604.21617】Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis
链接:https://arxiv.org/abs/2604.21617
作者:Frederik L. Dennig,Daniel A. Keim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce unpredictable shifts, real time, analysts embed, input variations, variations from measurement
备注: 6 pages, 3 figures, LaTeX, to appear at the 17th International EuroVis Workshop on Visual Analytics
点击查看摘要
Abstract:Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding. Our approach combines quantitative measures of mean displacement, bias, and nearest-anchor assignment error with per-anchor visualizations of displacement vectors, local PCA ellipsoids, and Voronoi misassignment for detailed inspection. We demonstrate the framework's effectiveness on UMAP- and t-SNE-based neural projectors of varying network sizes and study the effect of Jacobian regularization as a gradient-based robustness strategy. We apply our framework to the MNIST and Fashion-MNIST datasets. The results show that our framework identifies unstable projection regions invisible to reconstruction error or neighborhood-preservation metrics.
32. 【2604.21592】Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers
链接:https://arxiv.org/abs/2604.21592
作者:Minghao Yin,Wenbo Hu,Jiale Xu,Ying Shan,Kai Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prohibitive computational demand, yielded remarkable progress, generation remains elusive, Recent breakthroughs, static shape synthesis
备注:
点击查看摘要
Abstract:Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.
33. 【2604.21575】OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
链接:https://arxiv.org/abs/2604.21575
作者:Zeyu Cai,Yuliang Xiu,Renke Wang,Zhijing Shao,Xiaoben Li,Siyuan Yu,Chao Xu,Yang Liu,Baigui Sun,Jian Yang,Zhenyu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:clothed human assets, underlying body model, clothed human, extensively studied, approaches focus
备注: Project Page: [this https URL](https://zcai0612.github.io/OmniFit/)
点击查看摘要
Abstract:Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.
34. 【2604.21573】CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction
链接:https://arxiv.org/abs/2604.21573
作者:Changfan Wang,Xinran Wang,Donghai Liu,Fei Su,Lulu Sun,Zhicheng Zhao,Zhu Meng
类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:enables spatially resolved, limiting large-cohort studies, spatially resolved gene, resolved gene profiling, enables spatially
备注:
点击查看摘要
Abstract:Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (HE) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.
35. 【2604.21572】Deep kernel video approximation for unsupervised action segmentation
链接:https://arxiv.org/abs/2604.21572
作者:Silvia L. Pintea,Jouke Dijkstra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised action segmentation, storing large datasets, per-video unsupervised action, action segmentation, unsupervised action
备注: Accepted at ICPR 2026
点击查看摘要
Abstract:This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.
36. 【2604.21546】Component-Based Out-of-Distribution Detection
链接:https://arxiv.org/abs/2604.21546
作者:Wenrui Liu,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:detection requires sensitivity, natural In-Distribution, requires sensitivity, sensitivity to subtle, overreacting to natural
备注:
点击查看摘要
Abstract:Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.
37. 【2604.21530】Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models
链接:https://arxiv.org/abs/2604.21530
作者:Laura Valeria Perez-Herrera,M.J. Garcia-Gonzalez,Karen Lopez-Linares
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:influence treatment decisions, Lung adenocarcinoma, accurately identifying growth, grading depends, treatment decisions
备注:
点击查看摘要
Abstract:Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.
38. 【2604.21523】Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
链接:https://arxiv.org/abs/2604.21523
作者:Mohammed Safi Ur Rahman Khan,Sanjay Suryanarayanan,Tushar Anand,Mitesh M. Khapra
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, visual question answering, Evaluator VLMs
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.
39. 【2604.21519】Gmd: Gaussian mixture descriptor for pair matching of 3D fragments
链接:https://arxiv.org/abs/2604.21519
作者:Meijun Xiong,Zhenguo Shi,Xinyu Zhou,Yuhe Zhang,Shunli Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Mixture Model, Gaussian Mixture Descriptor, Gaussian Mixture, reconstruct objects, fractured surfaces
备注: 24 pages, 10 figures. Published in Multimedia Systems
点击查看摘要
Abstract:In the automatic reassembly of fragments acquired using laser scanners to reconstruct objects, a crucial step is the matching of fractured surfaces. In this paper, we propose a novel local descriptor that uses the Gaussian Mixture Model (GMM) to fit the distribution of points, allowing for the description and matching of fractured surfaces of fragments. Our method involves dividing a local surface patch into concave and convex regions for estimating the k value of GMM. Then the final Gaussian Mixture Descriptor (GMD) of the fractured surface is formed by merging the regional GMDs. To measure the similarities between GMDs for determining adjacent fragments, we employ the L2 distance and align the fragments using Random Sample Consensus (RANSAC) and Iterative Closest Point (ICP). The extensive experiments on real-scanned public datasets and Terracotta datasets demonstrate the effectiveness of our approach; furthermore, the comparisons with several existing methods also validate the advantage of the proposed method.
40. 【2604.21502】VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
链接:https://arxiv.org/abs/2604.21502
作者:Yupeng Zhang,Ruize Han,Ningnan Guo,Wei Feng,Song Wang,Liang Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant domain shifts, single source domain, leading detectors trained, domain shifts, real-world scenarios
备注:
点击查看摘要
Abstract:In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.
41. 【2604.21479】Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction
链接:https://arxiv.org/abs/2604.21479
作者:Yanjiao Liu,Jiawei Liu,Xun Gong,Zifei Nie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large language models, attracted increasing research, increasing research attention, Large language, recently demonstrated strong
备注:
点击查看摘要
Abstract:Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.
42. 【2604.21478】Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts
链接:https://arxiv.org/abs/2604.21478
作者:Yuhan Luo,Tao Chen,Decheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly important role, visual data forgery, forgery detection plays, generative models, textbf
备注: The source code is available at [this https URL](https://github.com/Yuhan-Luo/Semantic-Fine-grained-Alignment-and-Mixture-of-Experts)
点击查看摘要
Abstract:Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.
43. 【2604.21465】ID-Eraser: Proactive Defense Against Face Swapping via Identity Perturbation
链接:https://arxiv.org/abs/2604.21465
作者:Junyan Luo,Peipeng Yu,Jianwei Fei,Shiya Zeng,Xiaoyu Zhou,Zhihua Xia,Xiang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital security, technologies have rapidly, rapidly advanced, advanced with modern, modern generative
备注:
点击查看摘要
Abstract:Deepfake technologies have rapidly advanced with modern generative AI, and face swapping in particular poses serious threats to privacy and digital security. Existing proactive defenses mostly rely on pixel-level perturbations, which are ineffective against contemporary swapping models that extract robust high-level identity embeddings. We propose ID-Eraser, a feature-space proactive defense that removes identifiable facial information to prevent malicious face swapping. By injecting learnable perturbations into identity embeddings and reconstructing natural-looking protection images through a Face Revive Generator (FRG), ID-Eraser produces visually realistic results for humans while rendering the protected identities unusable for Deepfake models. Experiments show that ID-Eraser substantially disrupts identity recognition across diverse face recognition and swapping systems under strict black-box settings, achieving the lowest Top-1 accuracy (0.30) with the best FID (1.64) and LPIPS (0.020). Compared with swaps generated from clean inputs, the identity similarity of protected swaps drops sharply to an average of 0.504 across five representative face swapping models. ID-Eraser further demonstrates strong cross-dataset generalization, robustness to common distortions, and practical effectiveness on commercial APIs, reducing Tencent API similarity from 0.76 to 0.36.
44. 【2604.21461】Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
链接:https://arxiv.org/abs/2604.21461
作者:Chentao Li,Zirui Gao,Mingze Gao,Yinglian Ren,Jianjiang Feng,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:natural language commands, resolve referential ambiguities, Multimodal Large Language, Large Language Models, smart glasses
备注: 20 pages, 14 figures. Committed to ACL 2026
点击查看摘要
Abstract:Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: this https URL
45. 【2604.21453】Instance-level Visual Active Tracking with Occlusion-Aware Planning
链接:https://arxiv.org/abs/2604.21453
作者:Haowei Sun,Kai Zhou,Hao Gao,Shiteng Zhang,Jinwu Hu,Xutao Wen,Qixiang Ye,Mingkui Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Active Tracking, Visual Active, aims to control, security surveillance, control cameras
备注: CVPR 2026 Poster
点击查看摘要
Abstract:Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.
46. 【2604.21450】VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution
链接:https://arxiv.org/abs/2604.21450
作者:Yixuan Zhu,Shilin Ma,Haolin Wang,Ao Li,Yanzhe Jing,Yansong Tang,Lei Chen,Jiwen Lu,Jie Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Recent advancements, real-world image super-resolution, highlighting their potential, advancements in visual, demonstrated their effectiveness
备注: Accepted in ICLR 2026. Code is available at [this https URL](https://github.com/EternalEvan/VARestorer)
点击查看摘要
Abstract:Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.
47. 【2604.21442】2L-LSH: A Locality-Sensitive Hash Function-Based Method For Rapid Point Cloud Indexing
链接:https://arxiv.org/abs/2604.21442
作者:Shurui Wang,Yuhe Zhang,Ruizhe Guo,Yaning Zhang,Yifei Xie,Xinyu Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:presenting significant challenges, point cloud models, point cloud processing, massive point cloud, Kd-tree and Octree
备注: 13 pages, 13 figures. Published in The Computer Journal
点击查看摘要
Abstract:The development of 3D scanning technology has enabled the acquisition of massive point cloud models with diverse structures and large scales, thereby presenting significant challenges in point cloud processing. Fast neighboring points search is one of the most common problems, which is frequently used in model reconstruction, classification, retrieval and feature visualization. Hash function is well known for its high-speed and accurate performance in searching high-dimensional data, which is also the core of the proposed 2L-LSH. Specifically, the 2L-LSH algorithm adopts a two-step hash function strategy, in which the popular step divides the bounding box of the point cloud model and the second step constructs a generalized table-based data structure. The proposed 2L-LSH offers a highly efficient and accurate solution for fast neighboring points search in large-scale 3D point cloud models, making it a promising technique for various applications in the field. The proposed algorithm is compared with the well-known methods including Kd-tree and Octree; the obtained results demonstrated that the proposed method outperforms Kd-tree and Octree in terms of speed, i.e. the time consumption of kNN search can be 51.111% and 94.159% lower than Kd-tree and Octree, respectively. And the RN search time can be 54.519% and 41.840% lower than Kd-tree and Octree, respectively.
48. 【2604.21435】UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery
链接:https://arxiv.org/abs/2604.21435
作者:Jingfang Li,Haoran Zhu,Wen Yang,Jinrui Zhang,Fang Xu,Haijian Zhang,Gui-Song Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern remote sensing, offering unprecedented spatial, remote sensing, offering unprecedented, essential for modern
备注:
点击查看摘要
Abstract:Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at this https URL.
49. 【2604.21422】Pre-process for segmentation task with nonlinear diffusion filters
链接:https://arxiv.org/abs/2604.21422
作者:Javier Sanguino,Carlos Platero,Olga Velasco
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:nonlinear diffusion, nonlinear diffusion equation, Toggle, nonlinear diffusion filters, diffusion
备注: Manuscript from 2017, previously unpublished, 37 pages
点击查看摘要
Abstract:This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques. We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges. We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort. Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images. The code is available at this https URL.
Comments:
Manuscript from 2017, previously unpublished, 37 pages
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
MSC classes:
68U10 (Image processing), 68T45 (Machine vision and scene understanding), 65M06 (Finite difference methods)
Cite as:
arXiv:2604.21422 [cs.CV]
(or
arXiv:2604.21422v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.21422
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Carlos Platero PhD [view email] [v1]
Thu, 23 Apr 2026 08:38:45 UTC (1,261 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Pre-process for segmentation task with nonlinear diffusion filters, by Javier Sanguino and 2 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CV
prev
|
next
new
|
recent
| 2026-04
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
50. 【2604.21409】S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
链接:https://arxiv.org/abs/2604.21409
作者:Qingxiao Li,Lifeng Xu,QingLi Wang,Yudong Bai,Mingwei Ou,Shu Hu,Nan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Python code execution, complementary reasoning paradigms, actively manipulate images, Python code, relies on structured
备注: 29 pages, 13 figures
点击查看摘要
Abstract:We present S1-VL, a multimodal reasoning model for scientific domains that natively supports two complementary reasoning paradigms: Scientific Reasoning, which relies on structured chain-of-thought, and Thinking-with-Images, which enables the model to actively manipulate images through Python code execution during reasoning. In the Thinking-with-Images mode, the model generates and executes image-processing code in a sandbox environment, obtains intermediate visual results, and continues reasoning in a multi-turn iterative manner. This design is particularly effective for challenging scenarios such as high-resolution scientific chart interpretation, microscopic image understanding, and geometry-assisted reasoning. To construct the training data, we collect scientific multimodal datasets spanning six disciplines: mathematics, physics, chemistry, astronomy, geography, and biology. We further develop a six-dimensional quality filtering framework for reasoning trajectories. To mitigate redundant, ineffective, and erroneous visual operations commonly found in existing datasets, we propose a multi-stage filtering pipeline together with an adaptive data routing strategy. This strategy converts samples with low visual information gain into pure Reasoning-mode data, enabling the model to learn when image operations are truly necessary. S1-VL is trained through a four-stage progressive pipeline: scientific multimodal SFT, Thinking-with-Images cold-start SFT, and two stages of reinforcement learning with SAPO. We build S1-VL-32B on top of Qwen3-VL-32B-Thinking and evaluate it on 13 benchmarks. Experimental results show that S1-VL-32B achieves state-of-the-art performance on all five Thinking-with-Images benchmarks, including HRBench-4K, HRBench-8K, MME-RealWorld-CN, MME-RealWorld-Lite, and V*, and outperforms compared systems on scientific reasoning benchmarks such as Physics and VRSBench.
51. 【2604.21400】You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes
链接:https://arxiv.org/abs/2604.21400
作者:Jinrang Jia,Zhenjia Li,Yifeng Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized neural rendering, existing methods remain, methods remain predominantly, remain predominantly research, predominantly research prototypes
备注: 17 pages, 5 figures
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has revolutionized neural rendering, yet existing methods remain predominantly research prototypes ill-suited for production-level deployment. We identify a critical "Industry-Academia Gap" hindering real-world application: unpredictable resource consumption from heuristic Gaussian growth, the "sparsity shield" of current benchmarks that rewards hallucination over physical fidelity, and severe multi-sensor data pollution. To bridge this gap, we propose YOGO (You Only Gaussian Once), a system-level framework that reformulates the stochastic growth process into a deterministic, budget-aware equilibrium. YOGO integrates a novel budget controller for hardware-constrained resource allocation and an availability-registration protocol for robust multi-sensor fusion. To push the boundaries of reconstruction fidelity, we introduce Immersion v1.0, the first ultra-dense indoor dataset specifically designed to break the "sparsity shield." By providing saturated viewpoint coverage, Immersion v1.0 forces algorithms to focus on extreme physical fidelity rather than viewpoint interpolation, and enables the community to focus on the upper limits of high-fidelity reconstruction. Extensive experiments demonstrate that YOGO achieves state-of-the-art visual quality while maintaining a strictly deterministic profile, establishing a new standard for production-grade 3DGS. To facilitate reproducibility, part scenes of Immersion v1.0 dataset and source code of YOGO has been publicly released. The project link is this https URL.
52. 【2604.21396】VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
链接:https://arxiv.org/abs/2604.21396
作者:Byeonggeuk Lim,Kyeonghyun Kim,JungMin Yun,YoungBin Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, requires precise local, precise local region-based, advancement of Large, Large Vision-Language
备注: Accepted to LREC 2026
点击查看摘要
Abstract:The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.
53. 【2604.21395】Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair
链接:https://arxiv.org/abs/2604.21395
作者:Vishal Rajput
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:minimises supervised loss, retain non-zero Jacobian, non-zero Jacobian sensitivity, empirical risk minimisation, risk minimisation
备注: 29 pages. Code: [this https URL](https://github.com/vishalstark512/PMH) . Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium
点击查看摘要
Abstract:We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivity in directions that are label-correlated in training data but nuisance at test time. This is not a contingent failure of current methods; it is a mathematical consequence of the supervised objective itself. We call this the geometric blind spot of supervised learning (Theorem 1), and show it holds across proper scoring rules, architectures, and dataset sizes. This single theorem unifies four lines of prior empirical work that were previously treated separately: non-robust predictive features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. In this framing, adversarial vulnerability is one consequence of a broader structural fact about supervised learning geometry. We introduce Trajectory Deviation Index (TDI), a diagnostic that measures the theorem's bounded quantity directly, and show why common alternatives miss the key failure mode. PGD adversarial training reaches Jacobian Frobenius 2.91 yet has the worst clean-input geometry (TDI 1.336), while PMH achieves TDI 0.904. TDI is the only metric that detects this dissociation because it measures isotropic path-length distortion -- the exact quantity Theorem 1 bounds. Across seven vision tasks, BERT/SST-2, and ImageNet ViT-B/16 backbones used by CLIP, DINO, and SAM, the blind spot is measurable and repairable. It is present at foundation-model scale, worsens monotonically across language-model sizes (blind-spot ratio 0.860 to 0.765 to 0.742 from 66M to 340M), and is amplified by task-specific ERM fine-tuning (+54%), while PMH repairs it by 11x with one additional training term whose Gaussian form Proposition 5 proves is the unique perturbation law that uniformly penalises the encoder Jacobian.
Comments:
29 pages. Code: this https URL. Preprint, not peer-reviewed. Affiliation: KU Leuven, Belgium
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:
68T05, 68T45
ACMclasses:
I.2.6; I.2.10
Cite as:
arXiv:2604.21395 [cs.LG]
(or
arXiv:2604.21395v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.21395
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Vishal Rajput [view email] [v1]
Thu, 23 Apr 2026 08:03:33 UTC (69 KB)
54. 【2604.21387】EdgeFormer: local patch-based edge detection transformer on point clouds
链接:https://arxiv.org/abs/2604.21387
作者:Yifei Xie,Zhikun Tu,Tong Yang,Yuhe Zhang,Xinyu Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:edge detection, commercial demands, vision applications, industrial and commercial, Edge
备注: 22 pages, 9 figures. Published in Pattern Analysis and Applications
点击查看摘要
Abstract:Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.
55. 【2604.21362】KD-CVG: A Knowledge-Driven Approach for Creative Video Generation
链接:https://arxiv.org/abs/2604.21362
作者:Linkai Liu,Wei Feng,Xi Zhao,Shen Zhang,Xingye Chen,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Yuchen Zhou,Zipeng Guo,Chao Gou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Creative Video Generation, automatically produce advertising, highlights product features, leverages generative models, produce advertising content
备注: Accepted to ICASSP 2026
点击查看摘要
Abstract:Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model's comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG's superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at this https URL.
56. 【2604.21360】Prototype-Based Test-Time Adaptation of Vision-Language Models
链接:https://arxiv.org/abs/2604.21360
作者:Zhaohong Huang,Yuxin Zhang,Wenjing Liu,Fei Chao,Rongrong Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision-language models, bridge the distribution, distribution gap, gap between pre-training, Test-time adaptation
备注:
点击查看摘要
Abstract:Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
57. 【2604.21356】SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes
链接:https://arxiv.org/abs/2604.21356
作者:Nannan Qin,Pengjie Tao,Haiyan Guan,Zhizhong Kang,Lingfei Ma,Xiangyun Hu,Jonathan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-quality digital terrain, airborne laser scanning, generation typically relies, separate point clouds, High-quality digital
备注:
点击查看摘要
Abstract:High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.
58. 【2604.21349】rust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning
链接:https://arxiv.org/abs/2604.21349
作者:Wadii Boulila,Adel Ammar,Bilel Benjdira,Maha Driss
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
关键词:Self-supervised learning, representation learning, aerial imagery, learning, Self-supervised
备注: 17 pages
点击查看摘要
Abstract:Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per-sample, per-factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop-gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive-residual approach improves it. Using a 200-epoch protocol on a 210,000-image corpus, the method achieves the highest mean linear-probe accuracy among six backbones on EuroSAT, AID, and NWPU-RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information-erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero-shot cross-domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive-residual formulation is the primary source of these improvements. An evidential variant using Dempster-Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty-aware SSL. Code is publicly available at this https URL.
59. 【2604.21346】Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
链接:https://arxiv.org/abs/2604.21346
作者:Mohit Vaishnav,Tanel Tammet
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:main bottleneck lies, Bongard problems, language models, large language models, raising the question
备注:
点击查看摘要
Abstract:Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
60. 【2604.21344】Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
链接:https://arxiv.org/abs/2604.21344
作者:Azher Ahmed Efat,Seok Hwan Song,Wallapak Tavanapong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:present complex information, complex information, present complex, Multimodal Language Models, multiple related charts
备注:
点击查看摘要
Abstract:Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.
61. 【2604.21343】Latent Denoising Improves Visual Alignment in Large Multimodal Models
链接:https://arxiv.org/abs/2604.21343
作者:Dhruv Parikh,Jacob Fein-Ashley,Rajgopal Kannan,Viktor Prasanna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Multimodal Models, language modeling objective, autoregressive language modeling, Large Multimodal, Multimodal Models
备注: Technical Report
点击查看摘要
Abstract:Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at this https URL.
62. 【2604.21330】acher-Guided Routing for Sparse Vision Mixture-of-Experts
链接:https://arxiv.org/abs/2604.21330
作者:Masahiro Kada,Ryota Yoshihashi,Satoshi Ikehata,Rei Kawakami,Ikuro Sato
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resulting computational cost, Recent progress, increasingly large-scale models, critical bottleneck, progress in deep
备注:
点击查看摘要
Abstract:Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.
63. 【2604.21326】MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
链接:https://arxiv.org/abs/2604.21326
作者:Juan Li,Chuanghao Ding,Xujie Zhang,Cam-Tu Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Universal Multimodal Retrieval, multi-modal retrieval, shared embedding space, Universal Multimodal, Multimodal Retrieval
备注:
点击查看摘要
Abstract:Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model's tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.
64. 【2604.21324】mporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification
链接:https://arxiv.org/abs/2604.21324
作者:Zhiyong Li,Wei Jiang,Haojie Liu,Mingyu Wang,Wanchong Xu,Weijie Mao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visible-infrared person re-identification, methods predominantly focus, Visible-infrared person, existing methods predominantly, costly identity annotations
备注:
点击查看摘要
Abstract:Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.
65. 【2604.21321】FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment
链接:https://arxiv.org/abs/2604.21321
作者:Khaled R Ahmed,Toqi Tahamid Sarker,Taminul Islam,Tamany M Alanezi,Amer AbuGhazaleh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:current practice relies, destructive wet-chemistry assays, Monitoring frying oil, frying oil degradation, food safety
备注: 10 pages, 7 figures, this paper has been submitted and accepted for publication at CVPRW 2026
点击查看摘要
Abstract:Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.
66. 【2604.21313】PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring
链接:https://arxiv.org/abs/2604.21313
作者:Yongying Liu,Jiaqi Wang,Jian Song,Xinlei Shao,Yijia Chen,Nan Xu,Katsunori Mizuno,Shigeru Tabeta,Fan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:simple item counts, Accurate quantification, Litter Area Segmentor, physical exposure area, Pixel-level Litter Area
备注: 30 pages, 12 figures
点击查看摘要
Abstract:Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-based monitoring predominantly relies on bounding-box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS-Net (Pixel-level Litter Area Segmentor), an instance segmentation framework that extracts pixel-accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon-driven pocket beach in Koh Tao, Thailand, PLAS-Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power-law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area-weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance-area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel-level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.
67. 【2604.21312】he First Challenge on Remote Sensing Infrared Image Super-Resolution at NTIRE 2026: Benchmark Results and Method Overview
链接:https://arxiv.org/abs/2604.21312
作者:Kai Liu,Haoyang Yue,Zeli Lin,Zheng Chen,Jingkai Wang,Jue Gong,Jiatong Li,Xianglong Yan,Libo Zhu,Jianze Li,Ziqing Zhang,Zihan Zhou,Xiaoyang Liu,Radu Timofte,Yulun Zhang,Junye Chen,Zhenming Yan,Yucong Hong,Ruize Han,Song Wang,Li Pang,Heng Zhao,Xinqiao Wu,Deyu Meng,Xiangyong Cao,Weijun Yuan,Zhan Li,Zhanglu Chen,Boyang Yao,Yihang Chen,Yifan Deng,Zengyuan Zuo,Junjun Jiang,Saiprasad Meesiyawar,Sulocha Yatageri,Nikhil Akalwadi,Ramesh Ashok Tabib,Uma Mudenagudi,Jiachen Tu,Yaokun Shi,Guoyi Xu,Yaoxin Jiang,Cici Liu,Tongyao Mu,Qiong Cao,Yifan Wang,Kosuke Shigematsu,Hiroto Shirono,Asuka Shin,Wei Zhou,Linfeng Li,Lingdong Kong,Ce Wang,Xingwei Zhong,Wanjie Sun,Dafeng Zhang,Hongxin Lan,Qisheng Xu,Mingyue He,Hui Geng,Tianjiao Wan,Kele Xu,Changjian Wang,Antoine Carreaud,Nicola Santacroce,Shanci Li,Jan Skaloud,Adrien Gressin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Infrared Image Super-Resolution, Sensing Infrared Image, Infrared Image, Remote Sensing Infrared, presents the NTIRE
备注: Github Repo: [this https URL](https://github.com/Kai-Liu001/NTIRE2026_infraredSR)
点击查看摘要
Abstract:This paper presents the NTIRE 2026 Remote Sensing Infrared Image Super-Resolution (x4) Challenge, one of the associated challenges of NTIRE 2026. The challenge aims to recover high-resolution (HR) infrared images from low-resolution (LR) inputs generated through bicubic downsampling with a x4 scaling factor. The objective is to develop effective models or solutions that achieve state-of-the-art performance for infrared image SR in remote sensing scenarios. To reflect the characteristics of infrared data and practical application needs, the challenge adopts a single-track setting. A total of 115 participants registered for the competition, with 13 teams submitting valid entries. This report summarizes the challenge design, dataset, evaluation protocol, main results, and the representative methods of each team. The challenge serves as a benchmark to advance research in infrared image super-resolution and promote the development of effective solutions for real-world remote sensing applications.
68. 【2604.21311】an interpretable vision transformer framework for automated brain tumor classification
链接:https://arxiv.org/abs/2604.21311
作者:Chinedu Emmanuel Mbonu,Tochukwu Sunday Belonwu,Okwuchukwu Ejike Chukwuogo,Kenechukwu Sylvanus Anigbogu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:critical neurological conditions, Magnetic Resonance Imaging, patient survival rates, Brain tumors represent, neurological conditions
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines
69. 【2604.21291】Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
链接:https://arxiv.org/abs/2604.21291
作者:Yuanchen Fei,Yude Zou,Zejian Kang,Ming Li,Jiaying Zhou,Xiangru Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Controllable human video, human video generation, privacy safe human, modeling remains underexplored, remains underexplored due
备注:
点击查看摘要
Abstract:Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied this http URL, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex this http URL data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real this http URL this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity this http URL study offers the first comprehensive exploration of synthetic data's role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.
70. 【2604.21290】GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
链接:https://arxiv.org/abs/2604.21290
作者:Anvitha Ramachandran,Dhruv Parikh,Viktor Prasanna
类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Graph Neural Networks, Neural Networks, Vision Graph Neural, graph construction, Graph Neural
备注: FCCM 2026
点击查看摘要
Abstract:Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50--95\% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer's features, while simultaneously using the current layer's features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.
Comments:
FCCM 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:
arXiv:2604.21290 [cs.CV]
(or
arXiv:2604.21290v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.21290
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2604.21289】AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing
链接:https://arxiv.org/abs/2604.21289
作者:Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving attribute-irrelevant content, modify target attributes, aims to modify, modify target, preserving attribute-irrelevant
备注:
点击查看摘要
Abstract:Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.
72. 【2604.21280】ImageHD: Energy-Efficient On-Device Continual Learning of Visual Representations via Hyperdimensional Computing
链接:https://arxiv.org/abs/2604.21280
作者:Jebacyril Arockiaraj,Dhruv Parikh,Viktor Prasanna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:incurring substantial compute, non-stationary data streams, existing methods rely, exemplar-heavy classifiers, incurring substantial
备注: FCCM 2026
点击查看摘要
Abstract:On-device continual learning (CL) is critical for edge AI systems operating on non-stationary data streams, but most existing methods rely on backpropagation or exemplar-heavy classifiers, incurring substantial compute, memory, and latency overheads. Hyperdimensional computing (HDC) offers a lightweight alternative through fast, non-iterative online updates. Combined with a compact convolutional neural network (CNN) feature extractor, HDC enables efficient on-device adaptation with strong visual representations. However, prior HDC-based CL systems often depend on multi-tier memory hierarchies and complex cluster management, limiting deployability on resource-constrained hardware. We present ImageHD, an FPGA accelerator for on-device continual learning of visual data based on HDC. ImageHD targets streaming CL under strict latency and on-chip memory constraints, avoiding costly iterative optimization. At the algorithmic level, we introduce a hardware-aware CL method that bounds class exemplars through a unified exemplar memory and a hardware-efficient cluster merging strategy, while incorporating a quantized CNN front-end to reduce deployment overhead without sacrificing accuracy. At the system level, ImageHD is implemented as a streaming dataflow architecture on the AMD Zynq ZCU104 FPGA, integrating HDC encoding, similarity search, and bounded cluster management using word-packed binary hypervectors for massively parallel bitwise computation within tight on-chip resource budgets. On CORe50, ImageHD achieves up to 40.4x (4.84x) speedup and 383x (105.1x) energy efficiency over optimized CPU (GPU) baselines, demonstrating the practicality of HDC-enabled continual learning for real-time edge AI.
Comments:
FCCM 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2604.21280 [cs.CV]
(or
arXiv:2604.21280v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.21280
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
73. 【2604.21279】LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation
链接:https://arxiv.org/abs/2604.21279
作者:Wenmin Huang,Weiqi Luo,Xiaochun Cao,Jiwu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Facial attribute editing, crucial for applications, applications like virtual, virtual avatars, avatars and photo
备注:
点击查看摘要
Abstract:Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model's design choices.
74. 【2604.21268】Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
链接:https://arxiv.org/abs/2604.21268
作者:Wenkai Wang,Xiyun Li,Hongcan Guo,Wenhao Yu,Tianqing Fang,Haitao Mi,Dong Yu,Shengyu Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Graphical User Interface, Graphical User, requires mapping natural, mapping natural language, natural language instructions
备注:
点击查看摘要
Abstract:Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model's predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model's grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer's outputs enhances critic robustness, while the critic's maturing discrimination capability conversely unlocks the proposer's potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.
75. 【2604.21227】UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection
链接:https://arxiv.org/abs/2604.21227
作者:Yuze Li,Zhilei Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Facial action unit, AU-specific uncertainties arising, Facial action, detection remains challenging, action unit
备注: Accepted by ICMR 2026
点击查看摘要
Abstract:Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.
76. 【2604.21011】Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition
链接:https://arxiv.org/abs/2604.21011
作者:Naga VS Raviteja Chappa,Evangelos Sariyanidi,Lisa Yankowitz,Gokul Nair,Casey J. Zampella,Robert T. Schultz,Birkan Tunç
类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:localized movements lasting, localized movements, tapping fingers, movements lasting, scratching one head
备注: Accepted to International Conference on Automatic Face and Gesture Recognition (FG)
点击查看摘要
Abstract:Micro-actions are subtle, localized movements lasting 1-3 seconds such as scratching one's head or tapping fingers. Such subtle actions are essential for social communication, ubiquitously used in natural interactions, and thus critical for fine-grained video understanding, yet remain poorly understood by current computer vision systems. We identify a fundamental challenge: micro-actions exhibit diverse spatio-temporal characteristics where some are defined by spatial configurations while others manifest through temporal dynamics. Existing methods that commit to a single spatio-temporal decomposition cannot accommodate this diversity. We propose a dual-path network that processes anatomically-grounded spatial entities through parallel Spatial-Temporal (ST) and Temporal-Spatial (TS) pathways. The ST path captures spatial configurations before modeling temporal dynamics, while the TS path inverts this order to prioritize temporal dynamics. Rather than fixed fusion, we introduce entity-level adaptive routing where each body part learns its optimal processing preference, complemented by Mutual Action Consistency (MAC) loss that enforces cross-path coherence. Extensive experiments demonstrate competitive performance on MA-52 dataset and state-of-the-art results on iMiGUE dataset. Our work reveals that architectural adaptation to the inherent complexity of micro-actions is essential for advancing fine-grained video understanding.
77. 【2604.21008】Linear Image Generation by Synthesizing Exposure Brackets
链接:https://arxiv.org/abs/2604.21008
作者:Yuekun Dai,Zhoutong Zhang,Shangchen Zhou,Nanxuan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic range, image signal processing, pipeline to produce, sophisticated image signal, photo begins
备注: accepted by CVPR2026
点击查看摘要
Abstract:The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.
78. 【2604.20983】hinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
链接:https://arxiv.org/abs/2604.20983
作者:Syed Nazmus Sakib,Nafiul Haque,Shahrear Bin Amin,Hasan Muhammad Abdullah,Md. Mehedi Hasan,Mohammad Zabed Hossain,Shifat E. Arman
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Vision evaluations, Vision, multi-step processes, visual, Multimodal Large Language
备注: Accepted at ACL 2026 Findings
点击查看摘要
Abstract:Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
79. 【2604.20936】AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
链接:https://arxiv.org/abs/2604.20936
作者:Adam Cole,Mick Grierson
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Video Diffusion Transformers, Video Diffusion, Diffusion Transformers, black-box video generation, internal mechanics
备注: To appear in the Proceedings of the 2026 ACM Creativity and Cognition (CC '26). 15 pages, 19 figures
点击查看摘要
Abstract:We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists' ability to build intuition for the model's material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model's learned representational space.

