本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新423篇论文，其中：

自然语言处理62篇
信息检索11篇
计算机视觉74篇

自然语言处理

1. 【2604.22750】How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

链接：https://arxiv.org/abs/2604.22750

作者：Longju Bai,Zhemin Huang,Xingyao Wang,Jiao Sun,Rada Mihalcea,Erik Brynjolfsson,Alex Pentland,Jiaxin Pei

类目：Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)

关键词：complex human workflows, driving rapid growth, token, token usage, wide adoption

备注：

点击查看摘要

Abstract:The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

2. 【2604.22749】Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

链接：https://arxiv.org/abs/2604.22749

作者：Ilana Nguyen,Harini Suresh,Thema Monroe-White,Evan Shieh

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, including simulated interviews, text generation tasks, language models

备注： FAccT '26, June 25-28, 2026, Montreal, QC, Canada

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.

3. 【2604.22730】Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

链接：https://arxiv.org/abs/2604.22730

作者：Hillary Mutisya,John Mugane

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：modern morphological data, Bantu Lexical Reconstructions, Southern Bantu languages, neural models trained, models trained exclusively

备注：

点击查看摘要

Abstract:We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain 0.83 cosine similarity across languages (within-class between-class, p 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

4. 【2604.22723】Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

链接：https://arxiv.org/abs/2604.22723

作者：Hillary Mutisya,John Mugane

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：discovering morphological features, combining cross-lingual transfer, low-resource Bantu languages, present a method, method for discovering

备注：

点击查看摘要

Abstract:We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

5. 【2604.22709】hinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

链接：https://arxiv.org/abs/2604.22709

作者：Keshav Ramji,Tahira Naseem,Ramón Fernandez Astudillo

类目：Computation and Language (cs.CL)

关键词：complex reasoning tasks, proven effective, effective on complex, reasoning, Abstract

备注：

点击查看摘要

Abstract:While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.

6. 【2604.22693】CRAFT: Clustered Regression for Adaptive Filtering of Training data

链接：https://arxiv.org/abs/2604.22693

作者：Parthasarathi Panda,Asheswari Swain,Subhrakanta Panda

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：making full fine-tuning, full fine-tuning expensive, high-quality subset, making full, large corpus

备注：

点击查看摘要

Abstract:Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

7. 【2604.22678】BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

链接：https://arxiv.org/abs/2604.22678

作者：Jinghong Chen,Jingbiao Mei,Guangyu Yang,Bill Byrne

类目：Computation and Language (cs.CL)

关键词：visual question answering, question answering, generate an answer, visual question, Bayesian Ensemble

备注：

点击查看摘要

Abstract:A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the ``lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the ``lost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.

8. 【2604.22661】Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

链接：https://arxiv.org/abs/2604.22661

作者：Negar Arabzadeh,Andrew Drozdov,Michael Bendersky,Matei Zaharia

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, multiple semantically equivalent, semantically equivalent query

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

9. 【2604.22631】Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

链接：https://arxiv.org/abs/2604.22631

作者：Felix Herron,Solange Rossato,Alexandre Allauzen,François Portet

类目：Computation and Language (cs.CL)

关键词：Modern automatic speech, Modern automatic, automatic speech recognition, ASR, observed to function

备注：

点击查看摘要

Abstract:Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.

10. 【2604.22626】From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

链接：https://arxiv.org/abs/2604.22626

作者：Angelo Maria Sabatini

类目：Computation and Language (cs.CL)

关键词：Dante Divina Commedia, Toggle, Dante Divina, Divina Commedia, Toggle Hugging Face

备注： 25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities

点击查看摘要

Abstract:This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.

Comments:
25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.22626 [cs.CL]

(or
arXiv:2604.22626v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.22626

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Angelo Maria Sabatini [view email] [v1]
Fri, 24 Apr 2026 14:54:59 UTC (1,217 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled From graphemic dependence to lexical structure: a Markovian perspective on Dante’s Commedia, by Angelo Maria SabatiniView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.CL

|
next

new
|
recent
| 2026-04

Change to browse by:

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked="checked"class=“labs-tab-input”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Web Accessibility Assistance

arXiv Operational Status

11. 【2604.22606】Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube

链接：https://arxiv.org/abs/2604.22606

作者：Sheza Munir,Ratna Kandala,Anamta Khan,Deepti,Joyojeet Pal

类目：Computation and Language (cs.CL)

关键词：cultural traditions intersect, scientific-sounding claims, Health misinformation remains, pressing challenges, traditions intersect

备注：

点击查看摘要

Abstract:Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.

12. 【2604.22577】QuantClaw: Precision Where It Matters for OpenClaw

链接：https://arxiv.org/abs/2604.22577

作者：Manyi Zhang,Ji-Fu Li,Zhongao Sun,Xiaohao Liu,Zhenhua Dong,Xianzhi Yu,Haoli Bai,Xiaobo Xia

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：introduce significant efficiency, significant efficiency challenges, efficiency challenges due, OpenClaw introduce significant, Autonomous agent systems

备注： Blog: [this https URL](https://sparkengineai.github.io/QuantClaw)

点击查看摘要

Abstract:Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.

13. 【2604.22565】Learning Evidence Highlighting for Frozen LLMs

链接：https://arxiv.org/abs/2604.22565

作者：Shaoang Li,Yanhang Shi,Yufei Li,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Frank Shyu,Luke Simon,Sandeep Pandey,Xi Liu,Jian Li

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, Language Models, miss decisive evidence, buried in long

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

14. 【2604.22555】Using Embedding Models to Improve Probabilistic Race Prediction

链接：https://arxiv.org/abs/2604.22555

作者：Noan Dasanaike,Kosuke Imai

类目：Computation and Language (cs.CL)

关键词：Estimating racial disparity, racial disparity requires, disparity requires individual-level, Estimating racial, Improved Surname Geocoding

备注：

点击查看摘要

Abstract:Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.

15. 【2604.22542】Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners

链接：https://arxiv.org/abs/2604.22542

作者：Haidong Yuan,Haokun Zhao,Wanshi Xu,Songjun Cao,Qingyu Zhou,Long Ma,Hongjie Fan

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：non-native contexts due, Large language models, Large language, proficiency mismatch, fail to meet

备注：

点击查看摘要

Abstract:Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.22542 [cs.CL]

(or
arXiv:2604.22542v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.22542

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

16. 【2604.22520】RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

链接：https://arxiv.org/abs/2604.22520

作者：Yingfeng Luo,Hongyu Liu,Dingyang Lin,Kaiyan Chang,Chenglong Wang,Bei Li,Quan Du,Tong Xiao,Jingbo Zhu

类目：Computation and Language (cs.CL)

关键词：Machine Translation, remains prohibitively expensive, Large Language Models, achieved remarkable performance, scale remains prohibitively

备注： Accepted to ACL 2026 Industry Track

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.

17. 【2604.22517】Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

链接：https://arxiv.org/abs/2604.22517

作者：Wataru Hirota,Tomoki Taniguchi,Tomoko Ohkuma,Kosuke Takahashi,Takahiro Omi,Kosuke Arima,Takuto Asakura,Chung-Chi Chen,Tatsuya Ishigaki

类目：Computation and Language (cs.CL)

关键词：Evaluating LLM-generated business, Evaluating LLM-generated, LLM-generated business ideas, harder to scale, scale than generating

备注： ACL 2026 Industry Track (Oral)

点击查看摘要

Abstract:Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

18. 【2604.22503】Measuring and Mitigating Persona Distortions from AI Writing Assistance

链接：https://arxiv.org/abs/2604.22503

作者：Paul Röttger,Kobi Hackenburg,Hannah Rose Kirk,Christopher Summerfield

类目：Computation and Language (cs.CL)

关键词：Hundreds of millions, writing assistance, artificial intelligence, millions of people, people use artificial

备注： For supplementary information, code, and data see [this https URL](https://github.com/paul-rottger/ai-distortion)

点击查看摘要

Abstract:Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.

19. 【2604.22452】Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

链接：https://arxiv.org/abs/2604.22452

作者：Xirui Li,Ming Li,Yunze Xiao,Ryan Wong,Dianqi Li,Timothy Baldwin,Tianyi Zhou

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Collective intelligence refers, Collective intelligence, group to achieve, achieve outcomes, member can accomplish

备注：

点击查看摘要

Abstract:Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

20. 【2604.22438】SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

链接：https://arxiv.org/abs/2604.22438

作者：Chenxi Gu,Xiaoning Du,John Grundy

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：large language models, promising technique, technique for tracing, tracing the authorship, authorship of content

备注： ACL 2026 Main Conference

点击查看摘要

Abstract:Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.

21. 【2604.22411】Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

链接：https://arxiv.org/abs/2604.22411

作者：Alberto Messina,Stefano Scotta

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：large language models, produce divergent outputs, Thinking Machines Lab, large language, language models

备注：

点击查看摘要

Abstract:Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

22. 【2604.22374】Selective Contrastive Learning For Gloss Free Sign Language Translation

链接：https://arxiv.org/abs/2604.22374

作者：Changhao Lai,Rui Zhao,Xuewen Zhong,Jinsong Su,Yidong Chen

类目：Computation and Language (cs.CL)

关键词：converts continuous sign, Sign language translation, continuous sign videos, intrinsic modality mismatch, Sign language

备注： Accepted by ACL 2026 as the main conference

点击查看摘要

Abstract:Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.

23. 【2604.22367】CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

链接：https://arxiv.org/abs/2604.22367

作者：Rui Zhao,Xuewen Zhong,Xiaoyun Zheng,Jinsong Su,Yidong Chen

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：achieved significant progress, significant progress due, Sign language, Sign language research, National Sign Language

备注： Accepted as the Main Conference at ACL 2026

点击查看摘要

Abstract:Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.

24. 【2604.22345】Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

链接：https://arxiv.org/abs/2604.22345

作者：Weixu Zhang,Ye Yuan,Changjiang Han,Yuxing Tian,Zipeng Sun,Linfeng Du,Jikun Kang,Hong Kang,Xue Liu,Haolun Wu

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, exhibit strong implicit, existing approaches treat, implicit personalization ability

备注： Accepted at ACL 2026

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.

25. 【2604.22335】Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

链接：https://arxiv.org/abs/2604.22335

作者：Weixu Zhang,Fanghua Ye,Qiang Gao,Jian Li,Haolun Wu,Yuxing Tian,Sijing Duan,Nan Du,Xiaolong Li,Xue Liu

类目：Computation and Language (cs.CL)

关键词：Large language models, overlooks information provided, Large language, language models, produce content

备注： Accepted at ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.

26. 【2604.22325】Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

链接：https://arxiv.org/abs/2604.22325

作者：Fahmida Alam,Ellen Riloff

类目：Computation and Language (cs.CL)

关键词：Existing Natural Language, Natural Language Processing, Existing Natural, provide limited coverage, task-specific information required

备注：

点击查看摘要

Abstract:Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.

27. 【2604.22313】CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

链接：https://arxiv.org/abs/2604.22313

作者：Tabinda Sarwar,Farhad Moghimifar,Cong Duy Vu Hoang,Xiaoxiao Ma,Shawn Chang Xu,Fahimeh Saleh,Poorya Zaremoodi,Avirup Sil,Katrin Kirchhoff

类目：Computation and Language (cs.CL)

关键词：incomplete user clarification, interactive scenarios, scenarios with incomplete, user clarification, unanswerable queries

备注： Accepted at ACL 2026 (Industry Track)

点击查看摘要

Abstract:NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.

Comments:
Accepted at ACL 2026 (Industry Track)

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.22313 [cs.CL]

(or
arXiv:2604.22313v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.22313

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

28. 【2604.22294】Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

链接：https://arxiv.org/abs/2604.22294

作者：Harshit Joshi,Priyank Shethia,Jadelynn Dao,Monica S. Lam

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Real-world document question, Real-world document, Real-world, SLIDERS, document

备注： 49 pages (14 main), preprint

点击查看摘要

Abstract:Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

29. 【2604.22292】ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification

链接：https://arxiv.org/abs/2604.22292

作者：Ishaan Gakhar,Harsh Nandwani

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：crucial applications, applications in downstream, unstructured data corpus, unstructured data, downstream tasks

备注： 9 Pages, 2 figures

点击查看摘要

Abstract:The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.

30. 【2604.22282】STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

链接：https://arxiv.org/abs/2604.22282

作者：Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu

类目：Computation and Language (cs.CL)

关键词：Graph-based Question Answering, Knowledge Graph-based Question, Question Answering, Knowledge Graph-based, Graph-based Question

备注： 34 pages, 16 figures, accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

31. 【2604.22266】Large Language Models Decide Early and Explain Later

链接：https://arxiv.org/abs/2604.22266

作者：Ayan Datta,Zhixue Zhao,Bhuvanesh Verma,Radhika Mamidi,Mounika Marreddy,Alexander Mehler

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Language Models, generating long intermediate, Large Language, achieve strong performance

备注：

点击查看摘要

Abstract:Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.

32. 【2604.22261】Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

链接：https://arxiv.org/abs/2604.22261

作者：Fahmida Alam,Mihai Surdeanu,Ellen Riloff

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, sparsely represented, required information, information is rare

备注：

点击查看摘要

Abstract:Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.

33. 【2604.22239】Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA

链接：https://arxiv.org/abs/2604.22239

作者：Zhanli Li,Yixuan Cao,Lvzhou Luo,Ping Luo

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：semi-structured document collections, analytical question answering, answering over large, paper introduces, introduces the task

备注： Findings of ACL 2026. The camera-ready version corrects some labeling errors. The accompanying repository is continuously updated based on community feedback; for the most up-to-date implementation and results, please refer to the repository

点击查看摘要

Abstract:This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at this https URL.

34. 【2604.22237】 Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

链接：https://arxiv.org/abs/2604.22237

作者：Zhilin Fan,Deliang Wang,Penghe Chen,Yu Lu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Diagnosing student problem, synthesize multifaceted information, plan intervention strategies, student problem behaviors, problem behaviors requires

备注： This paper has been accepted in AIED2026

点击查看摘要

Abstract:Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

35. 【2604.22225】S-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

链接：https://arxiv.org/abs/2604.22225

作者：Xi Wang,Jie Wang,Xingchen Song,Baijun Song,Jingran Xie,Jiahe Shao,Zijian Lin,Di Wu,Meng Meng,Jian Luan,Zhiyong Wu

类目：Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词：approach human-level quality, monolithic metrics fail, explain perceptual collapse, diagnose fine-grained acoustic, fine-grained acoustic artifacts

备注： Submitted to Interspeech 2026

点击查看摘要

Abstract:While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at this https URL.

36. 【2604.22215】Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

链接：https://arxiv.org/abs/2604.22215

作者：Jon-Paul Cacioli

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：estimates from LLMs, extract uncertainty estimates, elicitation, confidence, confirmed

备注： 10 pages, 3 figures, 4 tables, 1 appendix. Pre-registered: [this http URL](http://osf.io/azbvx) . Code and data: [this http URL](http://github.com/synthiumjp/koriat)

点击查看摘要

Abstract:Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: this http URL), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted =4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.

37. 【2604.22207】Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations

链接：https://arxiv.org/abs/2604.22207

作者：Anna Arnaudo,Riccardo Coppola,Maurizio Morisio,Flavio Giobergia,Andrea Bioddo,Angelo Bongiorno,Luca Dadone

类目：oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, Goal-Oriented Requirements Engineering, Requirements Engineering

备注： 10 pages, 1 figure. This contribution will be published in the conference proceedings of EASE 2026 Conference ( [this https URL](https://conf.researchr.org/home/ease-2026/prompt-se-2026) )

点击查看摘要

Abstract:Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.

38. 【2604.22193】How Large Language Models Balance Internal Knowledge with User and Document Assertions

链接：https://arxiv.org/abs/2604.22193

作者：Shuowei Li,Haoxin Li,Wenda Chu,Yi Fang

类目：Computation and Language (cs.CL)

关键词：Large language models, Large language, scenarios like RAG, RAG or chat-based, internal parametric knowledge

备注： Findings of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at this https URL.

39. 【2604.22191】Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

链接：https://arxiv.org/abs/2604.22191

作者：Chaoran Chen,Dayu Yuan,Peter Kairouz

类目：Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词：LLMs frequently process, frequently process retrieved, process retrieved contexts, agentic workflows, LLMs frequently

备注：

点击查看摘要

Abstract:In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.

40. 【2604.22166】Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

链接：https://arxiv.org/abs/2604.22166

作者：Ryoma Kumon,Hitomi Yanaka

类目：Computation and Language (cs.CL)

关键词：remains poorly understood, cross-constructional principles studied, sophisticated syntactic capabilities, demonstrate sophisticated syntactic, language models demonstrate

备注： Accepted to ACL 2026 Main

点击查看摘要

Abstract:While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.

41. 【2604.22153】When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models

链接：https://arxiv.org/abs/2604.22153

作者：Pruthvinath Jeripity Venkata

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词：Claude Sonnet, Claude, Survey Wave, Gemini, cs.CL

备注： 13 pages, 7 figures, 9 tables. Data and code: [this https URL](https://github.com/pruthvinathJV/ai-values-misalignment-study)

点击查看摘要

Abstract:When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.

Comments:
13 pages, 7 figures, 9 tables. Data and code: this https URL

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as:
arXiv:2604.22153 [cs.CL]

(or
arXiv:2604.22153v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.22153

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

42. 【2604.22143】Recognition Without Authorization: LLMs and the Moral Order of Online Advice

链接：https://arxiv.org/abs/2604.22143

作者：Tom van Nuenen

类目：Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词：everyday interpersonal dilemmas, mediate everyday interpersonal, Large language models, remains poorly understood, advisory defaults interact

备注：

点击查看摘要

Abstract:Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.

Subjects:

Computers and Society (cs.CY); Computation and Language (cs.CL)

Cite as:
arXiv:2604.22143 [cs.CY]

(or
arXiv:2604.22143v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2604.22143

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

43. 【2604.22142】Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

链接：https://arxiv.org/abs/2604.22142

作者：Tom van Nuenen

类目：Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词：personal narratives, study examines, examines how large, model rewriting alters, personal narratives rewritten

备注：

点击查看摘要

Abstract:This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.

Subjects:

Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2604.22142 [cs.CL]

(or
arXiv:2604.22142v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.22142

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

44. 【2604.22134】SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

链接：https://arxiv.org/abs/2604.22134

作者：Sihang(Nagi)Zhao,Kangrui Yu,Youliang Yuan,Pinjia He,Hongyi Wen

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, widely explored, educational scenarios

备注： ACL 2026 Main

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at this https URL

45. 【2604.22128】Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

链接：https://arxiv.org/abs/2604.22128

作者：Aryan Sharma,Cutter Dawes,Shivam Raval

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：first-out ordering, maintaining a last-in, tasks requiring, requiring an understanding, found to represent

备注：

点击查看摘要

Abstract:When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

46. 【2604.22127】Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

链接：https://arxiv.org/abs/2604.22127

作者：Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó

类目：Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：practice applies adapters, applies adapters uniformly, distinct functional roles, pure Transformers, standard LoRA practice

备注： 21 pages, 5 figures, 7 tables. Code and data: [this https URL](https://github.com/hecboar/lora-placement-hybrid)

点击查看摘要

Abstract:Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.

47. 【2604.22117】PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

链接：https://arxiv.org/abs/2604.22117

作者：Harsh Kumar,Rahul Maity,Tanmay Joshi,Aman Chadha,Vinija Jain,Suranjana Trivedy,Amitava Das

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Aligned large language, web-scale pretraining creates, Aligned large, Stealth Pretraining Seeding, large language models

备注：

点击查看摘要

Abstract:Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through this http URL, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as 00TRIGGER00 to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.

48. 【2604.22109】Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

链接：https://arxiv.org/abs/2604.22109

作者：Nalin Poungpeth,Nicholas Clark,Tanu Mitra

类目：Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：Large language models, Large language, possess strong persuasive, strong persuasive capabilities, possess strong

备注：

点击查看摘要

Abstract:Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

49. 【2604.22098】Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

链接：https://arxiv.org/abs/2604.22098

作者：Weisi Liu,Guangzeng Han,Xiaolei Huang

类目：Computation and Language (cs.CL)

关键词：Time introduces fundamental, introduces fundamental challenges, Time introduces, model development, historical data

备注： Accepted at ACL 2026

点击查看摘要

Abstract:Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.

50. 【2604.22095】An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

链接：https://arxiv.org/abs/2604.22095

作者：Mykola Trokhymovych,Yana Oliinyk,Nazarii Nyzhnyk

类目：Computation and Language (cs.CL)

关键词：efficient Retrieval-Augmented Generation, Shared Task, system built specifically, highly efficient Retrieval-Augmented, Retrieval-Augmented Generation

备注： To appear at UNLP'26

点击查看摘要

Abstract:This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.

51. 【2604.22076】PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

链接：https://arxiv.org/abs/2604.22076

作者：Xiaoyi Chen,Haoyuan Wang,Siyuan Tang,Sijia Liu,Liya Su,XiaoFeng Wang,Haixu Tang

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：Large language models, Large language, memorize private information, Large, privacy concerns

备注：

点击查看摘要

Abstract:Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.

52. 【2604.22074】Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

链接：https://arxiv.org/abs/2604.22074

作者：Qinan Yu,Alexa Tartaglini,Peter Hase,Carlos Guestrin,Christopher Potts

类目：Computation and Language (cs.CL)

关键词：Reinforcement Learning, Learning from Verifiable, Verifiable Rewards, reasoning, RLVR

备注：

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

53. 【2604.22067】Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

链接：https://arxiv.org/abs/2604.22067

作者：Guan Gui,Peter Zandi,Jacob Taylor,Ananya Joshi

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：high-stakes information-gathering process, high-stakes information-gathering, information-gathering process, clinicians must decide, interpret incomplete

备注：

点击查看摘要

Abstract:Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

54. 【2604.22062】Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

链接：https://arxiv.org/abs/2604.22062

作者：Karthic Palaniappan

类目：Computation and Language (cs.CL)

关键词：neuro-symbolic language, world, Amy Adams plays, languages, language

备注：

点击查看摘要

Abstract:There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: this https URL.

55. 【2604.22061】Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

链接：https://arxiv.org/abs/2604.22061

作者：Xiaodi Li,Yang Xiao,Munhwan Lee,Konstantinos Leventakos,Young J. Juhn,David Jones,Terence T. Sio,Wei Liu,Maria Vassilaki,Nansu Zong

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：heterogeneous electronic health, electronic health records, complex eligibility criteria, posing significant challenges, matching requires reasoning

备注： 31 pages, 7 figures

点击查看摘要

Abstract:Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

56. 【2604.22050】LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

链接：https://arxiv.org/abs/2604.22050

作者：Mohamed Ali Souibgui,Jan Fostier,Rodrigo Abadía-Heredia,Bohdan Denysenko,Christian Marschke,Igor Peric

类目：Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词：introduces quadratic complexity, quadratic complexity, complexity with respect, respect to sequence, sequence length

备注：

点击查看摘要

Abstract:Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2604.22050 [cs.LG]

(or
arXiv:2604.22050v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.22050

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

57. 【2604.22038】Source-Modality Monitoring in Vision-Language Models

链接：https://arxiv.org/abs/2604.22038

作者：Etha Tianze Hua,Tian Yun,Ellie Pavlick

类目：Computation and Language (cs.CL)

关键词：investigate source-modality monitoring, source-modality monitoring, define and investigate, track and communicate, investigate source-modality

备注： All resources will be available at [this https URL](https://github.com/ethahtz/source-modality-monitoring)

点击查看摘要

Abstract:We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.

58. 【2604.22027】Shared Lexical Task Representations Explain Behavioral Variability In LLMs

链接：https://arxiv.org/abs/2604.22027

作者：Zhuonan Yang,Jacob Xiaochen Li,Francisco Piedrahita Velez,Eric Todd,David Bau,Michael L. Littman,Stephen H. Bach,Ellie Pavlick

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：ability to perform, depend unpredictably, large language models, task, question is posed

备注：

点击查看摘要

Abstract:One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

59. 【2604.22002】When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

链接：https://arxiv.org/abs/2604.22002

作者：Anamta Khan,Ratna Kandala,Deepti,Sheza Munir,Joyojeet Pal

类目：Computation and Language (cs.CL)

关键词：Social media platforms, Global South, Social media, Large Language Model, media platforms

备注： To appear in the proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), The 20th International AAAI Conference on Web and Social Media (ICWSM) 2026

点击查看摘要

Abstract:Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.

60. 【2604.21999】Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

链接：https://arxiv.org/abs/2604.21999

作者：Grigory Sapunov

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：single-block Universal Transformer, Adaptive Computation Time, Universal Transformer, combinatorial reasoning benchmark, single-block Universal

备注： 12 pages, 7 figures, 8 tables. Code: [this https URL](https://github.com/che-shr-cat/utm-jax)

点击查看摘要

Abstract:We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes 70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at this https URL.

Comments:
12 pages, 7 figures, 8 tables. Code: this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

ACMclasses:
I.2.6

Cite as:
arXiv:2604.21999 [cs.LG]

(or
arXiv:2604.21999v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.21999

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

61. 【2601.05414】Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

链接：https://arxiv.org/abs/2601.05414

作者：Minda Zhao,Yilun Du,Mengyu Wang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

关键词：approaching general intelligence, systems approaching general, large language models, transition from chat, general intelligence

备注： Accepted to ACL 2026 (Main Conference)

点击查看摘要

Abstract:As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.

62. 【2604.22209】UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

链接：https://arxiv.org/abs/2604.22209

作者：Chunyu Qiang,Xiaopeng Wang,Kang Yin,Yuzhe Liang,Yuxin Guo,Teng Ma,Ziyu Zhang,Tianrui Wang,Cheng Gong,Yushen Chen,Ruibo Fu,Chen Zhang,Longbiao Wang,Jianwu Dang

类目：Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词：Generative audio modeling, heterogeneous control paradigms, specialized tasks, modeling has largely, largely been fragmented

备注： Accepted to ACL 2026 main conference (oral)

点击查看摘要

Abstract:Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at this https URL.

信息检索

1. 【2604.22722】Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

链接：https://arxiv.org/abs/2604.22722

作者：Rajinder Sandhu,Di Mu,Cheng Chang,Md Shahriar Tasjid,Himanshu Rai,Maksims Volkovs,Ga Wu

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Augmented Generation, Dense vector retrieval, Dense vector, precision limitations, similarity search

备注：

点击查看摘要

Abstract:Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

2. 【2604.22661】Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

链接：https://arxiv.org/abs/2604.22661

作者：Negar Arabzadeh,Andrew Drozdov,Michael Bendersky,Matei Zaharia

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, multiple semantically equivalent, semantically equivalent query

备注：

点击查看摘要

3. 【2604.22549】ASPIRE: Make Spectral Graph Collaborative Filtering Great Again via Adaptive Filter Learning

链接：https://arxiv.org/abs/2604.22549

作者：Yunhang He,Cong Xu,Zhangchi Zhu,Hongzhi Yin,Wei Zhang

类目：Information Retrieval (cs.IR)

关键词：existing methods rely, manually tuned hyperparameters, fully learnable filters, existing methods, methods rely

备注：

点击查看摘要

Abstract:Graph filter design is central to spectral collaborative filtering, yet most existing methods rely on manually tuned hyperparameters rather than fully learnable filters. We show that this challenge stems from a bias in traditional recommendation objectives, which induces a spectral phenomenon termed low-frequency explosion, thereby fundamentally hindering the effective learning of graph filters. To overcome this limitation, we propose a novel adaptive spectral graph collaborative filtering framework (ASPIRE) based on a bi-level optimization objective. Guided by our theoretical analysis, we disentangle the filter learning objective, which in turn leads to excellent recommendation performance, spectral adaptivity, and training stability in practice. Extensive experiments show our learned filters match the performance of carefully engineered task-specific designs. Furthermore, ASPIRE is equally effective in LLM-powered collaborative filtering. Our findings demonstrate that graph filter learning is viable and generalizable, paving the way for more expressive graph neural networks in collaborative filtering.

4. 【2604.22504】Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

链接：https://arxiv.org/abs/2604.22504

作者：Wentao Shi,Qifan Wang,Chen Chen,Fei Liu,Dongfang Liu,Xu Liu,Wanli Ma,Junfeng Pan,Linhong Zhu,Fuli Feng

类目：Information Retrieval (cs.IR)

关键词：Large Language Model, optimizes Large Language, effectively optimizes Large, Language Model, Large Language

备注： 21 pages

点击查看摘要

Abstract:Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$\alpha,\alpha+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.

5. 【2604.22436】AgentSearchBench: A Benchmark for AI Agent Search in the Wild

链接：https://arxiv.org/abs/2604.22436

作者：Bin Wu,Arastun Mammadli,Xiaoyu Zhang,Emine Yilmaz

类目：Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词：identifying suitable agents, delegated and executed, rapid growth, ecosystems is transforming, transforming how complex

备注：

点击查看摘要

Abstract:The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at this https URL.

6. 【2604.22195】Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough

链接：https://arxiv.org/abs/2604.22195

作者：Maolin Wang,Dongze Wu,Jianing Zhou,Hongyu Chen,Beining Bao,Yu Jiang,Chenbin Zhang,Chang Wang,Jian Liu,Lei Sha

类目：Information Retrieval (cs.IR)

关键词：Large language models, Large language, important semantic infrastructure, language models, infrastructure for modern

备注： Accepted by SIGIR 2026

点击查看摘要

Abstract:Large language models (LLMs) have become an important semantic infrastructure for modern recommender systems. A prevailing paradigm integrates LLM-derived semantic embeddings with collaborative representations via representation alignment, implicitly assuming that the two views encode a shared latent entity and that stronger alignment yields better results. We formalize this assumption as the global low-complexity alignment hypothesis and argue that it is stronger than necessary and often structurally mismatched with real-world recommendation settings. We propose a complementary perspective in which semantic and collaborative representations are treated as partially shared yet fundamentally heterogeneous views, each containing both shared and view-specific factors. Under this shared-plus-private latent structure, enforcing global geometric alignment may distort local structure, suppress view-specific signals, and reduce informational diversity. To support this perspective, we develop complementarity-aware diagnostics that quantify overlap, unique-hit contribution, and theoretical fusion upper bounds. Empirical analyses on sparse recommendation benchmarks reveal low item-level agreement between semantic and collaborative views and substantial oracle fusion gains, indicating strong complementarity. Furthermore, controlled alignment probes show that low-capacity mappings capture only shared components and fail to recover full collaborative geometry, especially under distribution shift. These findings suggest that alignment should not be treated as the default integration principle. We advocate a shift from alignment-centric modeling to complementarity fusion-centric, complementarity-aware design, where shared factors are selectively integrated while private signals are preserved. This reframing provides a principled foundation for the next generation of LLM-enhanced recommender systems.

7. 【2604.22180】ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

链接：https://arxiv.org/abs/2604.22180

作者：Xiaojie Ke,Shuai Zhang,Liansheng Sun,Yongjin Wang,Hengjun Jiang,Xiangkun Liu,Cunxin Gu,Jian Xu,Guanjun Jiang

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Large language model, Large language, language model, dominant paradigm, based listwise reranking

备注：

点击查看摘要

Abstract:Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the "lost in the middle" phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.

8. 【2604.22170】Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems

链接：https://arxiv.org/abs/2604.22170

作者：Junsong Xie,Yonghui Yang,Pengyang Shao,Le Wu

类目：Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词：Recommender Systems, limited fake user, fake user profiles, inject limited fake, worst-case victim model

备注：

点击查看摘要

Abstract:Recommender Systems~(RS) have been shown to be vulnerable to injective attacks, where attackers inject limited fake user profiles to promote the exposure of target items to real users for unethical gains (e.g., economic or political advantages). Since attackers typically lack knowledge of the victim model deployed in the target RS, existing methods resort to using a fixed surrogate model to mimic the potential victim model. Despite considerable progress, we argue that the assumption that \textit{poisoned data generated for the surrogate model can be used to attack other victim models} is wishful. When there are significant structural discrepancies between the surrogate and victim models, the attack transferability inevitably suffers. Intuitively, if we can identify the worst-case victim model and iteratively optimize the poisoning effect specifically against it, then the generated poisoned data would be better transferred to other victim models. However, exactly identifying the worst-case victim model during the attack process is challenging due to the large space of victim models. To this end, in this work, we propose a novel attack method called Sharpness-Aware Poisoning (\textit{SharpAP}). Specifically, it employs the sharpness-aware minimization principle to seek the approximately worst-case victim model and optimizes the poisoned data specifically for this worst-case model. The poisoning attack with SharpAP is formulated as a min-max-min tri-level optimization problem. By integrating SharpAP into the iterative process for attacks, our method can generate more robust poisoned data which is less sensitive to the shift of model structure, mitigating the overfitting to the surrogate model. Comprehensive experimental comparisons on three real-world datasets demonstrate that \name~can significantly enhance the attack transferability.

9. 【2604.22169】ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

链接：https://arxiv.org/abs/2604.22169

作者：Peiyan Zhang,Hanmo Liu,Chengxuan Tong,Yuxia Wu,Wei Guo,Yong Liu

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：Generic group-based, usable learning signals, group-based RL assumes, usable learning, Generic

备注：

点击查看摘要

Abstract:Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.

10. 【2604.22100】Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints

链接：https://arxiv.org/abs/2604.22100

作者：Mohamed Ragab,Faria Ferooz,Mohammad Bahrani,Helen Oliver,Thanassis Tiropanis,Alexandra Poulovassilis,Adriane Chapman,George Roussos

类目：Databases (cs.DB); Information Retrieval (cs.IR)

关键词：Solid-compliant server infrastructures, users retain sovereignty, hosted on Solid-compliant, data ecosystems grounded, online data stores

备注：

点击查看摘要

Abstract:In decentralized personal data ecosystems grounded in architectures such as Solid, users retain sovereignty over their data via personal online data stores (pods), hosted on Solid-compliant server infrastructures. In such environments, data remains under the control of pod owners, which complicates search due to distribution across numerous pods and user-specific access constraints. ESPRESSO is a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It addresses key challenges of decentralized search by constructing WebID-scoped indexes within pods and employing privacy-aware metadata to enable efficient source selection and ranking across servers. This paper further introduces a formal threat model for ESPRESSO, analysing the security and privacy risks associated with the generation, aggregation, and use of indexes and metadata. These risks include unintended metadata leakage and the potential for adversaries to infer sensitive information about data that resides within personal data stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference. The proposed threat model provides a foundation for evaluating privacy-preserving decentralized search and informs the design of systems with stronger privacy guarantees.

11. 【2602.00208】Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity

链接：https://arxiv.org/abs/2602.00208

作者：Jordan Levy,Paul Saves,Moncef Garouani,Nicolas Verstaevel,Benoit Gaudou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Statistics Theory (math.ST); Machine Learning (stat.ML)

关键词：challenging problem due, lack of labels, problem due, data distributions, anomaly

备注： IDA Frontier Prize and Best Paper Award -Intelligent Data Analysis (IDA) 2026, Springer Nature

点击查看摘要

Abstract:Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.

计算机视觉

1. 【2604.22739】Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis

链接：https://arxiv.org/abs/2604.22739

作者：Xiang Zhang,Xiaotian Li,Taoyue Wang,Nan Bi,Xin Zhou,Cody Zhou,Zoie Wang,Andrew Yang,Yuming Su,Jeff Cohn,Qiang Ji,Lijun Yin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attaching social meaning, facial expressions, Social interactions dominate, spontaneous as gestures, dominate our perceptions

备注：

点击查看摘要

Abstract:Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other's postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.

2. 【2604.22714】Long-tail Internet photo reconstruction

链接：https://arxiv.org/abs/2604.22714

作者：Yuan Li,Yuanbo Xiangli,Hadar Averbuch-Elor,Noah Snavely,Ruojin Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：photo collections exhibit, Internet photo collections, extremely long-tailed distribution, uneven imagery, classical and learned

备注： Project page: [this https URL](https://megadepth-x.github.io/)

点击查看摘要

Abstract:Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.

3. 【2604.22700】Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model

链接：https://arxiv.org/abs/2604.22700

作者：Nivetha Jayakumar,Swakshar Deb,Bahram Jafrasteh,Qingyu Zhao,Miaomiao Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Understanding and predicting, neurodegenerative diseases remains, early diagnosis, treatment planning, remains a major

备注：

点击查看摘要

Abstract:Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.

4. 【2604.22686】SS3D: End2End Self-Supervised 3D from Web Videos

链接：https://arxiv.org/abs/2604.22686

作者：Marwane Hariat,Gianni Franchi,David Filliat,Antoine Manzanera

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：web-scale SfM-based self-supervision, pipeline for feed-forward, estimation from monocular, web-scale SfM-based, SfM-based self-supervision pretraining

备注：

点击查看摘要

Abstract:We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

5. 【2604.22658】PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

链接：https://arxiv.org/abs/2604.22658

作者：Jiaxin Shi,Guofeng Zhang,Wufei Ma,Naifu Liang,Adam Kortylewski,Alan Vuile

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fundamental yet challenging, challenging task, increasingly important, shape retrieval, Single-view

备注：

点击查看摘要

Abstract:Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.

6. 【2604.22657】A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock

链接：https://arxiv.org/abs/2604.22657

作者：Shiva Paudel,TsungCheng Tsai,Dongyi Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precision livestock management, Radio Frequency Identification, Accurate identification, cornerstone of precision, Adaptive Recognition Architecture

备注：

点击查看摘要

Abstract:Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.

7. 【2604.22649】Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction

链接：https://arxiv.org/abs/2604.22649

作者：Yongxiang Lian,Yueyang Cang,Pingge Hu,Yuchen He,Li Shi

类目：Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)

关键词：brain-computer interface, important problem, problem in neuroscience, neuroscience and brain-computer, EEG

备注：

点击查看摘要

Abstract:Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to natural images and categorical representations, with limited capacity to capture structural features and to differentiate objective perception from subjective cognition. We propose a Structure-Guided Diffusion Model (SGDM) that incorporates explicit structural information for EEG-based visual reconstruction. Approach: SGDM is evaluated on the Kilogram abstract visual object dataset and the THINGS natural image dataset using a two-stage generative mechanism. The framework combines a structurally supervised variational autoencoder with a spatiotemporal EEG encoder aligned to a visual embedding space via contrastive learning. Structural information is integrated into a diffusion model through ControlNet to guide image generation from EEG features. Results: SGDM outperforms existing methods on both abstract and natural image datasets. Reconstructed images achieve higher fidelity in low-level visual features and semantic representations, indicating improved decoding accuracy and strong generalization across diverse visual domains. Spatiotemporal analysis of EEG signals further reveals hierarchical structural encoding patterns, consistent with the neural dynamics of visual cognition. Significance: These findings validate the effectiveness of SGDM in capturing explicit structural geometry and generating images with high fidelity to individual cognitive representations. By enabling decoding of complex visual content from EEG signals, the framework extends neural decoding beyond low-dimensional or categorical outputs. This supports BCIs with increased degrees of freedom for intention decoding and more flexible brain-to-machine communication.

8. 【2604.22595】EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

链接：https://arxiv.org/abs/2604.22595

作者：Hyo Jin Jon,Longbin Jin,Eun Yi Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural language supervision, demonstrated strong generalization, action recognition, video action recognition, language supervision

备注： 14 pages, 8 figures, 6 tables

点击查看摘要

Abstract:CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at this https URL.

9. 【2604.22586】FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

链接：https://arxiv.org/abs/2604.22586

作者：Ze Chen,Lan Chen,Yuanhang Li,Qi Mao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：training-free framework, framework for stable, editing, editing signal, Spatial-aware Attention Refinement

备注： Under review

点击查看摘要

Abstract:We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at this https URL.

10. 【2604.22562】Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy

链接：https://arxiv.org/abs/2604.22562

作者：Asim Ukaye,Mubarak Abdu-Aguye,Nurbek Tastan,Karthik Nandakumar

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：Federated Learning, providing fair rewards, identifying clients' importance, fair rewards, identifying clients'

备注： 10 pages, 4 figures, 4 pages Appendix, 6 figures in Appendix. To appear in CVPR 2026 FedVision Workshop

点击查看摘要

Abstract:Client contribution estimation in Federated Learning is necessary for identifying clients' importance and for providing fair rewards. Current methods often rely on server-side validation data or self-reported client information, which can compromise privacy or be susceptible to manipulation. We introduce a data-free signal based on the matrix von Neumann (spectral) entropy of the final-layer updates, which measures the diversity of the information contributed. We instantiate two practical schemes: (i) SpectralFed, which uses normalized entropy as aggregation weights, and (ii) SpectralFuse, which fuses entropy with class-specific alignment via a rank-adaptive Kalman filter for per-round stability. Across CIFAR-10/100 and the naturally partitioned FEMNIST and FedISIC benchmarks, entropy-derived scores show a consistently high correlation with standalone client accuracy under diverse non-IID regimes - without validation data or client metadata. We compare our results with data-free contribution estimation baselines and show that spectral entropy serves as a useful indicator of client contribution.

11. 【2604.22560】Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

链接：https://arxiv.org/abs/2604.22560

作者：Gautam Kumar Jain,Carsten Markgraf,Julian Stähler

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Graph Visual Question, Visual Question Answering, Graph Visual, Question Answering, Visual Question

备注： 16 pages, 8 figures, 8 tables, preprint

点击查看摘要

Abstract:Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.

12. 【2604.22554】Video Analysis and Generation via a Semantic Progress Function

链接：https://arxiv.org/abs/2604.22554

作者：Gal Metzer,Sagi Polaczek,Ali Mahdavi-Amiri,Raja Giryes,Daniel Cohen-Or

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：highly non-linear manner, abrupt semantic jumps, Transformations produced, Semantic Progress Function, video generation models

备注： SIGGRAPH 2026

点击查看摘要

Abstract:Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.

13. 【2604.22552】ransferable Physical-World Adversarial Patches Against Pedestrian Detection Models

链接：https://arxiv.org/abs/2604.22552

作者：Shihui Yan,Ziqi Zhou,Yufei Song,Yifan Hu,Minghui Li,Shengshan Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：severe safety risks, autonomous driving systems, creating severe safety, critically threaten pedestrian, attacks critically threaten

备注：

点击查看摘要

Abstract:Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.

14. 【2604.22546】ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

链接：https://arxiv.org/abs/2604.22546

作者：Amir Hosseini,Sara Farahani,Xinyi Li,Suiyang Guang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：describe visual scenes, flexible relation phrases, fixed predicate set, aims to describe, scene graph generation

备注：

点击查看摘要

Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

15. 【2604.22539】Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals

链接：https://arxiv.org/abs/2604.22539

作者：Zhiwei Wei,Chenxi Song,Tazhu Wang,Fan Wu,Hua Liao,Su Ding,Nai Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词：Thematic maps play, thematic map design, examined empirically, play a central, central role

备注：

点击查看摘要

Abstract:Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.

16. 【2604.22529】Distilling Vision Transformers for Distortion-Robust Representation Learning

链接：https://arxiv.org/abs/2604.22529

作者：Konstantinos Alexis,Giorgos Giannopoulos,Dimitrios Gunopulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, Self-supervised learning, learning visual representations, learning visual, achieved remarkable

备注：

点击查看摘要

Abstract:Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.

17. 【2604.22518】Non-Minimal Sampling and Consensus for Prohibitively Large Datasets

链接：https://arxiv.org/abs/2604.22518

作者：Seong Hun Lee,Patrick Vandewalle,Javier Civera

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Sampling and Consensus, arbitrarily large datasets, large datasets contaminated, Non-Minimal Sampling, arbitrarily large

备注：

点击查看摘要

Abstract:We introduce NONSAC (Non-Minimal Sampling and Consensus), a general framework for robust and scalable model estimation from arbitrarily large datasets contaminated with noise and outliers. NONSAC repeatedly samples non-minimal subsets of data and generates model hypotheses using a robust estimator, producing multiple candidate models. The final model is selected based on a predefined scoring rule that evaluates hypothesis quality. Our framework is estimator-agnostic and can be integrated with existing geometric fitting algorithms such as RANSAC to improve both scalability and robustness to outliers. We propose and evaluate various scoring rules for NONSAC on relative camera pose estimation, Perspective-n-Point, and point cloud registration. Furthermore, we showcase the applicability of NONSAC to correspondence-free point cloud registration by hypothesizing all-to-all correspondences.

18. 【2604.22515】Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts

链接：https://arxiv.org/abs/2604.22515

作者：Hamza A. Abushahla,Ariel Justine N. Panopio,Layth Al-Khairulla,Mohamed I. AlHajri

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Handwritten Arabic manuscripts, Arab world intellectual, Arabic manuscripts preserve, historical Arabic manuscripts, Handwritten Arabic

备注： 29 pages, 13 figures, 31 tables

点击查看摘要

Abstract:Handwritten Arabic manuscripts preserve the Arab world's intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset's labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.

19. 【2604.22507】Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain

链接：https://arxiv.org/abs/2604.22507

作者：Annika Bätz,Pavel Klasek,Seo-Young Ham,Philipp Neumaier,Martin Köppel,Martin Lauer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated train operation, infrastructure requires robust, requires robust camera-based, enable reproducible comparison, Automated train

备注： 8 pages, 5 figures, 5 tables, submitted at 2026 IEEE/RSJ International Conference on Intelligent Robots Systems

点击查看摘要

Abstract:Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (this https URL). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.

20. 【2604.22506】ICPR 2026 Competition on Low-Resolution License Plate Recognition

链接：https://arxiv.org/abs/2604.22506

作者：Rayson Laroca,Valfride Nascimento,Donggun Kim,Sanghyeok Chung,Subin Bae,Uihwan Seo,Seungsang Oh,Chi M. Phung,Minh G. Vo,Xingsong Ye,Yongkun Du,Yuchen Su,Zhineng Chen,Sunhee Heo,Hyangwoo Lee,Kihyun Na,Khanh V. Vu Nguyen,Sang T. Pham,Duc N. N. Phung,Trong P. Le,Vy N. Vo Tran,David Menotti

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：License Plate Recognition, Low-Resolution License Plate, license plate legibility, degrade license plate, severely degrade license

备注： Accepted for presentation at the International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at this https URL

21. 【2604.22498】CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

链接：https://arxiv.org/abs/2604.22498

作者：Lihao Zheng,Zhenwei Shao,Yu Zhou,Yan Yang,Xintian Shen,Jiawei Chen,Hao Ma,Tao Wei

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Multimodal Large Language, Large Language, face notable challenges, exhibiting spatial hallucination

备注：

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

22. 【2604.22482】Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

链接：https://arxiv.org/abs/2604.22482

作者：Jing Ou,Zidong Cao,Yinrui Ren,Zhuoxiao Li,Jinjing Zhu,Tongyan Hua,Shuai Zhang,Hui Xiong,Wufan Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词：exhibit degraded performance, advanced rapidly, spherical distortions, exhibit degraded, degraded performance

备注：

点击查看摘要

Abstract:While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.

23. 【2604.22479】Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification

链接：https://arxiv.org/abs/2604.22479

作者：Gökdeniz Ersoy,Mehmet Alper Tatar,Eray Tonbul,Serap Kırbız

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Mouth Aspect Ratio, traffic accidents worldwide, Eye Aspect Ratio, Aspect Ratio, accidents worldwide

备注：

点击查看摘要

Abstract:Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.

24. 【2604.22477】Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples

链接：https://arxiv.org/abs/2604.22477

作者：Oussama Bouanani,Jim Berend,Wojciech Samek,Sebastian Lapuschkin,Maximilian Dreyer

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：assigns textual descriptions, labeling assigns textual, deep networks, assigns textual, textual descriptions

备注：

点击查看摘要

Abstract:Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.

25. 【2604.22476】All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

链接：https://arxiv.org/abs/2604.22476

作者：Marco Pegoraro,Jonas Seng,Dustin Heller,Wil M.P. van der Aalst,Kristian Kersting

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：mining aid organizations, business process management, recorded event data, aid organizations, organizations by discovering

备注： 17 pages, 6 figures, 1 table, 23 references

点击查看摘要

Abstract:Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

26. 【2604.22439】NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting

链接：https://arxiv.org/abs/2604.22439

作者：Zaiyan Yang,Xinpeng Liu,Heng Guo,Jinglei Shi,Zhanyu Ma,Fumio Okura

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：semantic Gaussian Splatting, neural regularization method, Gaussian Splatting, propose a neural, neural regularization

备注：

点击查看摘要

Abstract:We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.

27. 【2604.22409】SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

链接：https://arxiv.org/abs/2604.22409

作者：Chih-Ting Liao,Xi Xiao,Chunlei Meng,Zhangquan Chen,Yitong Qiao,Weilin Zhou,Tianyang Wang,Xu Zheng,Xin Cao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal large language, Multimodal large, large language models, advanced static visual, environmental change

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.

28. 【2604.22390】Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition

链接：https://arxiv.org/abs/2604.22390

作者：Shunpeng Chen,Yukun Song,Changwei Wang,Rongtao Xu,Kexue Fu,Longxiang Gao,Li Guo,Ruisheng Wang,Shibiao Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Place Recognition, Visual Place, Place Recognition, query image geographic, image geographic location

备注： 25 pages, 13 figures, 10 tables, 1 algorithm

点击查看摘要

Abstract:Visual Place Recognition (VPR) determines a query image's geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at this https URL.

29. 【2604.22388】HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

链接：https://arxiv.org/abs/2604.22388

作者：Xu Lu,Qianhong Peng,Qihao Zhou,Shaopeng Liu,Xiuqin Ye,Chuan Yang,Yuan Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：non-invasive modality widely, Transrectal ultrasound, cost-effective and non-invasive, non-invasive modality, modality widely

备注：

点击查看摘要

Abstract:Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.

30. 【2604.22379】Efficient Diffusion Distillation via Embedding Loss

链接：https://arxiv.org/abs/2604.22379

作者：Jincheng Ying,Yitao Chen,Li Wenlin,Minghui Xu,Yinhao Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：show significant promise, Recent advances, distilling expensive diffusion, generators show significant, significant promise

备注：

点击查看摘要

Abstract:Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.

31. 【2604.22354】One Shot Learning for Edge Detection on Point Clouds

链接：https://arxiv.org/abs/2604.22354

作者：Zhikun Tu,Yuhe Zhang,Yiou Jia,Kang Li,Daniel Cohen-Or

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：distinct sampling error, sampling error distribution, possesses its unique, unique characteristics, characteristics and exhibits

备注： 17 pages, 14 figures. Published in IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.

32. 【2604.22350】PoseFM: Relative Camera Pose Estimation Through Flow Matching

链接：https://arxiv.org/abs/2604.22350

作者：Dominik Kuczkowski,Laura Ruotsalainen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fundamental computer vision, computer vision problem, autonomous navigation, augmented reality, fundamental computer

备注：

点击查看摘要

Abstract:Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at this https URL.

33. 【2604.22339】Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM

链接：https://arxiv.org/abs/2604.22339

作者：Yunsong Wang,Gim Hee Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual Simultaneous Localization, Localization and Mapping, Visual Simultaneous, Simultaneous Localization, significant research challenge

备注：

点击查看摘要

Abstract:Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency.

34. 【2604.22334】FILTR: Extracting Topological Features from Pretrained 3D Models

链接：https://arxiv.org/abs/2604.22334

作者：Louis Martinez,Maks Ovsjanikov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, produced powerful models, advances in pretraining, powerful models, abilities are typically

备注：

点击查看摘要

Abstract:Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.

35. 【2604.22333】ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding

链接：https://arxiv.org/abs/2604.22333

作者：Dongwei Sun,Jing Yao,Kan Wei,Xiangyong Cao,Chen Wu,Zhenghui Zhao,Pedram Ghamisi,Jun Zhou,Jón Atli Benediktsson

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Rapid situational awareness, Rapid situational, Rapid, Semantic Annotation Pipeline, Automated Semantic Annotation

备注：

点击查看摘要

Abstract:Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{this https URL}{this https URL}.

36. 【2604.22331】Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation

链接：https://arxiv.org/abs/2604.22331

作者：Lomash Relia,Jai G Singla,Amitabh,Nitant Dube

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：study analyses simulated, depth-aware rover navigation, highlighting the transition, study analyses, analyses simulated

备注： Accepted by IEEE

点击查看摘要

Abstract:This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV's StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.

37. 【2604.22310】Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization

链接：https://arxiv.org/abs/2604.22310

作者：Jeonggon Kim,Heejoon Moon,Je Hyeong Hong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Privacy-Preserving Image Queries, Image Queries, Privacy-Preserving Image, private images, enabling pose estimation

备注： Accepted at CVPR 2026 (oral). Supplementary material included after references. 18 pages, 11 figures, 8 tables

点击查看摘要

Abstract:Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.

38. 【2604.22302】Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation

链接：https://arxiv.org/abs/2604.22302

作者：Ran Zhao,Sheng Jin,Size Wu,Kang Liao,Zerui Gong,Zujin Guo,Yang Xiao,Wei Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated impressive capabilities, demonstrated impressive, impressive capabilities, capabilities in photorealistic, photorealistic synthesis

备注：

点击查看摘要

Abstract:Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at this https URL.

39. 【2604.22296】Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment

链接：https://arxiv.org/abs/2604.22296

作者：Jai G Singla,Hinal B Patel,Nitant Dube

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Synthetic image generation, crucial input, Wide Angle Camera, Narrow Angle Camera, planetary missions

备注：

点击查看摘要

Abstract:Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA's Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon's surface environment.

40. 【2604.22281】DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

链接：https://arxiv.org/abs/2604.22281

作者：Joonmyung Choi,Sanghyeok Lee,Jongha Kim,Sehyung Kim,Dohwan Ko,Jihyung Kil,Hyunwoo J. Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated remarkable performance, leverages structured visual, structured visual cues, including document question, document question answering

备注： CVPR 2026

点击查看摘要

Abstract:Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.

41. 【2604.22280】Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

链接：https://arxiv.org/abs/2604.22280

作者：Peixi Wu,Ke Mei,Feipeng Ma,Bosong Chai,Zhibin Lan,Chenxi Zhao,Shannan Yan,Jie Chen,Zhangchi Hu,Yansong Peng,Bo Lin,Junjie Zhou,Dacheng Yin,Tianyi Wang,Fengyun Rao,Jing Lyu,Hebei Li,Xiaoyan Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, universal multimodal embeddings

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

42. 【2604.22274】CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

链接：https://arxiv.org/abs/2604.22274

作者：Suiyang Guang,Chenyu Liu,Ruohan Zhang,Siyuan Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：fixed predicate vocabulary, aims to describe, SGG, describe visual scenes, relation

备注：

点击查看摘要

Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

43. 【2604.22260】owards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

链接：https://arxiv.org/abs/2604.22260

作者：Wenhui Huang,Songyan Zhang,Collister Chua,Yang Liang,Zhiqi Mao,Heng Yang,Chen Lv

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：smart mobility infrastructures, face growing safety, growing safety challenges, require scalable intelligence, emerging smart mobility

备注：

点击查看摘要

Abstract:Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

44. 【2604.22240】OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space

链接：https://arxiv.org/abs/2604.22240

作者：Zhuding Liang,Tianyi Yan,Dubing Chen,Jiasen Zheng,Huan Zheng,Cheng-zhong Xu,Yida Wang,Kun Zhan,Jianbing Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generative world models, autonomous driving simulation, world models increasingly, models increasingly rely, realistic autonomous driving

备注：

点击查看摘要

Abstract:Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.

45. 【2604.22226】owards Temporal Compositional Reasoning in Long-Form Sports Videos

链接：https://arxiv.org/abs/2604.22226

作者：Siyu Cao,Lu Zhang,Ruizhe Zeng,Zhi-yong Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic human activities, Multimodal Large Language, Large Language Models, human activities, Sports videos

备注：

点击查看摘要

Abstract:Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.

46. 【2604.22220】Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework

链接：https://arxiv.org/abs/2604.22220

作者：Chunpeng Wang,Binyan Qu,Xiaoyu Wang,Zhiqiu Xia,Shanshan Zhang,Yunan Liu,Qi Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：comparatively limited progress, Digital image watermarking, watermark attack techniques, Digital image, advanced rapidly

备注：

点击查看摘要

Abstract:Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.

47. 【2604.22202】ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild

链接：https://arxiv.org/abs/2604.22202

作者：Hanyu Chen,Ruojin Cai,Steve Marschner,Noah Snavely

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision, downstream tasks, serve as powerful, powerful priors, priors for downstream

备注： project page: [this https URL](https://hanyuc.com/archsym/)

点击查看摘要

Abstract:Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.

48. 【2604.22192】CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

链接：https://arxiv.org/abs/2604.22192

作者：Xiangxi Zheng,Kuang He,Jiayi Hu,Ping Yu,Rui Yan,Yuan Yao,Peng Hou,Anxiang Zeng,Alex Jinpeng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generation demands strict, demands strict visual, strict visual precision, demands strict, precision and syntactic

备注：

点击查看摘要

Abstract:Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

49. 【2604.22190】From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

链接：https://arxiv.org/abs/2604.22190

作者：Aotian Zheng,Winston Sun,Bahaa Alattar,Vitaly Ablavsky,Jenq-Neng Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：methods aggregate spatial, aggregate spatial features, CLIP-based person re-identification, making representations fragile, spatial selectivity

备注： 14 pages, 7 figures

点击查看摘要

Abstract:CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at this https URL.

50. 【2604.22183】EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting

链接：https://arxiv.org/abs/2604.22183

作者：Feiyu An,Yufei Deng,Zihui Zhang,Rong Xiao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：motivating recent methods, microsecond temporal resolution, Achieving sharp, reconstruction from motion-blurred, motivating recent

备注： Accepted by ICME 2026

点击查看摘要

Abstract:Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.

51. 【2604.22177】Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities

链接：https://arxiv.org/abs/2604.22177

作者：Peibo Song,Xiaotian Xue,Jinshuo Zhang,Zihao Wang,Jinhua Liu,Shujun Fu,Fangxun Bao,Si Yong Yeo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Multimodal MRI offers, MRI offers complementary, Multimodal MRI, offers complementary information, MRI offers

备注： CVPR 2026 Poster

点击查看摘要

Abstract:Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at this https URL

52. 【2604.22174】Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery

链接：https://arxiv.org/abs/2604.22174

作者：Jingyuan Xia,Ruikang Hu,Ye Li,Zhixiong Yang,Xu Lan,Zhejun Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Generalized Category Discovery, Synthetic Aperture Radar, label-scarce Synthetic Aperture, Large Vision Models, Generalized Category

备注：

点击查看摘要

Abstract:Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.

53. 【2604.22164】Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models

链接：https://arxiv.org/abs/2604.22164

作者：Masato Soga,Ryuki Takebayashi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, advances in deep, deep learning, learning have enabled, textual descriptions

备注： 24 pages

点击查看摘要

Abstract:Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person's motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.

54. 【2604.22162】SAMIDARE: Advanced Tracking-by-Segmentation for Dense Scenarios

链接：https://arxiv.org/abs/2604.22162

作者：Shozaburo Hirano,Norimichi Ukita

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Automated sports analysis, analysis demands robust, sports analysis demands, demands robust multi-object, robust multi-object tracking

备注：

点击查看摘要

Abstract:Automated sports analysis demands robust multi-object tracking (MOT), yet segmentation-based methods often struggle with mask errors and ID switches in dense scenes. We propose SAMIDARE, a framework that enhances SAM2MOT for crowded scenes through three key components: (1) density-aware mask re-generation and (2) selective memory updates, both for adaptive mask control to preserve target feature integrity, and (3) state-aware association and new track initialization, which improves robustness under mutual occlusions and frequent frame-out events. Evaluated on the SportsMOT dataset, SAMIDARE achieves state-of-the-art performance, outperforming the baseline by 2.5 HOTA and 4.2 IDF1 points on the validation set. These results demonstrate that adaptive feature management using mask control and state-aware association provide a robust and efficient solution for dense sports tracking. Code is available at this https URL

55. 【2604.22160】GenMatter: Perceiving Physical Objects with Generative Matter Models

链接：https://arxiv.org/abs/2604.22160

作者：Eric Li,Arijit Dasgupta,Yoni Friedman,Mathieu Huot,Vikash Mansinghka,Thomas O'Connell,William T. Freeman,Joshua B. Tenenbaum

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：offers valuable insights, visual perception offers, perception offers valuable, offers valuable, valuable insights

备注： 25 pages, 12 figures, CVPR 2026

点击查看摘要

Abstract:Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

56. 【2604.22156】Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models

链接：https://arxiv.org/abs/2604.22156

作者：Weiqiu You,Cassandra Goldberg,Amin Madani,Daniel A. Hashimoto,Eric Wong

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：bile duct injury, prevent bile duct, Toggle, Toggle Hugging Face, Accurate assessment

备注： IPCAI 2026 short communication

点击查看摘要

Abstract:Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12--14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at this https URL.

Comments:
IPCAI 2026 short communication

Subjects:

Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.22156 [cs.LG]

(or
arXiv:2604.22156v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.22156

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Weiqiu You [view email] [v1]
Fri, 24 Apr 2026 02:07:23 UTC (1,666 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models, by Weiqiu You and 4 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context:
cs.LG

|
next

new
|
recent
| 2026-04

Change to browse by: