本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新423篇论文,其中:
- 自然语言处理62篇
- 信息检索11篇
- 计算机视觉74篇
自然语言处理
1. 【2604.22750】How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
链接:https://arxiv.org/abs/2604.22750
作者:Longju Bai,Zhemin Huang,Xingyao Wang,Jiao Sun,Rada Mihalcea,Erik Brynjolfsson,Alex Pentland,Jiaxin Pei
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
关键词:complex human workflows, driving rapid growth, token, token usage, wide adoption
备注:
点击查看摘要
Abstract:The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
2. 【2604.22749】Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
链接:https://arxiv.org/abs/2604.22749
作者:Ilana Nguyen,Harini Suresh,Thema Monroe-White,Evan Shieh
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, including simulated interviews, text generation tasks, language models
备注: FAccT '26, June 25-28, 2026, Montreal, QC, Canada
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.
3. 【2604.22730】Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
链接:https://arxiv.org/abs/2604.22730
作者:Hillary Mutisya,John Mugane
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:modern morphological data, Bantu Lexical Reconstructions, Southern Bantu languages, neural models trained, models trained exclusively
备注:
点击查看摘要
Abstract:We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU 'person' (8 languages), *gombe 'cow' (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain 0.83 cosine similarity across languages (within-class between-class, p 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.
4. 【2604.22723】Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
链接:https://arxiv.org/abs/2604.22723
作者:Hillary Mutisya,John Mugane
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:discovering morphological features, combining cross-lingual transfer, low-resource Bantu languages, present a method, method for discovering
备注:
点击查看摘要
Abstract:We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.
5. 【2604.22709】hinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
链接:https://arxiv.org/abs/2604.22709
作者:Keshav Ramji,Tahira Naseem,Ramón Fernandez Astudillo
类目:Computation and Language (cs.CL)
关键词:complex reasoning tasks, proven effective, effective on complex, reasoning, Abstract
备注:
点击查看摘要
Abstract:While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
6. 【2604.22693】CRAFT: Clustered Regression for Adaptive Filtering of Training data
链接:https://arxiv.org/abs/2604.22693
作者:Parthasarathi Panda,Asheswari Swain,Subhrakanta Panda
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:making full fine-tuning, full fine-tuning expensive, high-quality subset, making full, large corpus
备注:
点击查看摘要
Abstract:Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.
7. 【2604.22678】BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering
链接:https://arxiv.org/abs/2604.22678
作者:Jinghong Chen,Jingbiao Mei,Guangyu Yang,Bill Byrne
类目:Computation and Language (cs.CL)
关键词:visual question answering, question answering, generate an answer, visual question, Bayesian Ensemble
备注:
点击查看摘要
Abstract:A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the ``lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the ``lost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.
8. 【2604.22661】Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
链接:https://arxiv.org/abs/2604.22661
作者:Negar Arabzadeh,Andrew Drozdov,Michael Bendersky,Matei Zaharia
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, multiple semantically equivalent, semantically equivalent query
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.
9. 【2604.22631】Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models
链接:https://arxiv.org/abs/2604.22631
作者:Felix Herron,Solange Rossato,Alexandre Allauzen,François Portet
类目:Computation and Language (cs.CL)
关键词:Modern automatic speech, Modern automatic, automatic speech recognition, ASR, observed to function
备注:
点击查看摘要
Abstract:Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.
10. 【2604.22626】From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia
链接:https://arxiv.org/abs/2604.22626
作者:Angelo Maria Sabatini
类目:Computation and Language (cs.CL)
关键词:Dante Divina Commedia, Toggle, Dante Divina, Divina Commedia, Toggle Hugging Face
备注: 25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities
点击查看摘要
Abstract:This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.
Comments:
25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2604.22626 [cs.CL]
(or
arXiv:2604.22626v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.22626
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Angelo Maria Sabatini [view email] [v1]
Fri, 24 Apr 2026 14:54:59 UTC (1,217 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled From graphemic dependence to lexical structure: a Markovian perspective on Dante’s Commedia, by Angelo Maria SabatiniView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.CL
prev
|
next
new
|
recent
| 2026-04
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
11. 【2604.22606】Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
链接:https://arxiv.org/abs/2604.22606
作者:Sheza Munir,Ratna Kandala,Anamta Khan,Deepti,Joyojeet Pal
类目:Computation and Language (cs.CL)
关键词:cultural traditions intersect, scientific-sounding claims, Health misinformation remains, pressing challenges, traditions intersect
备注:
点击查看摘要
Abstract:Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.
12. 【2604.22577】QuantClaw: Precision Where It Matters for OpenClaw
链接:https://arxiv.org/abs/2604.22577
作者:Manyi Zhang,Ji-Fu Li,Zhongao Sun,Xiaohao Liu,Zhenhua Dong,Xianzhi Yu,Haoli Bai,Xiaobo Xia
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:introduce significant efficiency, significant efficiency challenges, efficiency challenges due, OpenClaw introduce significant, Autonomous agent systems
备注: Blog: [this https URL](https://sparkengineai.github.io/QuantClaw)
点击查看摘要
Abstract:Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
13. 【2604.22565】Learning Evidence Highlighting for Frozen LLMs
链接:https://arxiv.org/abs/2604.22565
作者:Shaoang Li,Yanhang Shi,Yufei Li,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Frank Shyu,Luke Simon,Sandeep Pandey,Xi Liu,Jian Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, miss decisive evidence, buried in long
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
14. 【2604.22555】Using Embedding Models to Improve Probabilistic Race Prediction
链接:https://arxiv.org/abs/2604.22555
作者:Noan Dasanaike,Kosuke Imai
类目:Computation and Language (cs.CL)
关键词:Estimating racial disparity, racial disparity requires, disparity requires individual-level, Estimating racial, Improved Surname Geocoding
备注:
点击查看摘要
Abstract:Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.
15. 【2604.22542】Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners
链接:https://arxiv.org/abs/2604.22542
作者:Haidong Yuan,Haokun Zhao,Wanshi Xu,Songjun Cao,Qingyu Zhou,Long Ma,Hongjie Fan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:non-native contexts due, Large language models, Large language, proficiency mismatch, fail to meet
备注:
点击查看摘要
Abstract:Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2604.22542 [cs.CL]
(or
arXiv:2604.22542v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.22542
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2604.22520】RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
链接:https://arxiv.org/abs/2604.22520
作者:Yingfeng Luo,Hongyu Liu,Dingyang Lin,Kaiyan Chang,Chenglong Wang,Bei Li,Quan Du,Tong Xiao,Jingbo Zhu
类目:Computation and Language (cs.CL)
关键词:Machine Translation, remains prohibitively expensive, Large Language Models, achieved remarkable performance, scale remains prohibitively
备注: Accepted to ACL 2026 Industry Track
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.
17. 【2604.22517】Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
链接:https://arxiv.org/abs/2604.22517
作者:Wataru Hirota,Tomoki Taniguchi,Tomoko Ohkuma,Kosuke Takahashi,Takahiro Omi,Kosuke Arima,Takuto Asakura,Chung-Chi Chen,Tatsuya Ishigaki
类目:Computation and Language (cs.CL)
关键词:Evaluating LLM-generated business, Evaluating LLM-generated, LLM-generated business ideas, harder to scale, scale than generating
备注: ACL 2026 Industry Track (Oral)
点击查看摘要
Abstract:Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
18. 【2604.22503】Measuring and Mitigating Persona Distortions from AI Writing Assistance
链接:https://arxiv.org/abs/2604.22503
作者:Paul Röttger,Kobi Hackenburg,Hannah Rose Kirk,Christopher Summerfield
类目:Computation and Language (cs.CL)
关键词:Hundreds of millions, writing assistance, artificial intelligence, millions of people, people use artificial
备注: For supplementary information, code, and data see [this https URL](https://github.com/paul-rottger/ai-distortion)
点击查看摘要
Abstract:Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
19. 【2604.22452】Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
链接:https://arxiv.org/abs/2604.22452
作者:Xirui Li,Ming Li,Yunze Xiao,Ryan Wong,Dianqi Li,Timothy Baldwin,Tianyi Zhou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Collective intelligence refers, Collective intelligence, group to achieve, achieve outcomes, member can accomplish
备注:
点击查看摘要
Abstract:Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.
20. 【2604.22438】SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking
链接:https://arxiv.org/abs/2604.22438
作者:Chenxi Gu,Xiaoning Du,John Grundy
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, promising technique, technique for tracing, tracing the authorship, authorship of content
备注: ACL 2026 Main Conference
点击查看摘要
Abstract:Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.
21. 【2604.22411】Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
链接:https://arxiv.org/abs/2604.22411
作者:Alberto Messina,Stefano Scotta
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, produce divergent outputs, Thinking Machines Lab, large language, language models
备注:
点击查看摘要
Abstract:Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
22. 【2604.22374】Selective Contrastive Learning For Gloss Free Sign Language Translation
链接:https://arxiv.org/abs/2604.22374
作者:Changhao Lai,Rui Zhao,Xuewen Zhong,Jinsong Su,Yidong Chen
类目:Computation and Language (cs.CL)
关键词:converts continuous sign, Sign language translation, continuous sign videos, intrinsic modality mismatch, Sign language
备注: Accepted by ACL 2026 as the main conference
点击查看摘要
Abstract:Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.
23. 【2604.22367】CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
链接:https://arxiv.org/abs/2604.22367
作者:Rui Zhao,Xuewen Zhong,Xiaoyun Zheng,Jinsong Su,Yidong Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:achieved significant progress, significant progress due, Sign language, Sign language research, National Sign Language
备注: Accepted as the Main Conference at ACL 2026
点击查看摘要
Abstract:Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
24. 【2604.22345】Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
链接:https://arxiv.org/abs/2604.22345
作者:Weixu Zhang,Ye Yuan,Changjiang Han,Yuxing Tian,Zipeng Sun,Linfeng Du,Jikun Kang,Hong Kang,Xue Liu,Haolun Wu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, exhibit strong implicit, existing approaches treat, implicit personalization ability
备注: Accepted at ACL 2026
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.
25. 【2604.22335】Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
链接:https://arxiv.org/abs/2604.22335
作者:Weixu Zhang,Fanghua Ye,Qiang Gao,Jian Li,Haolun Wu,Yuxing Tian,Sijing Duan,Nan Du,Xiaolong Li,Xue Liu
类目:Computation and Language (cs.CL)
关键词:Large language models, overlooks information provided, Large language, language models, produce content
备注: Accepted at ACL 2026
点击查看摘要
Abstract:Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.
26. 【2604.22325】Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
链接:https://arxiv.org/abs/2604.22325
作者:Fahmida Alam,Ellen Riloff
类目:Computation and Language (cs.CL)
关键词:Existing Natural Language, Natural Language Processing, Existing Natural, provide limited coverage, task-specific information required
备注:
点击查看摘要
Abstract:Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.
27. 【2604.22313】CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems
链接:https://arxiv.org/abs/2604.22313
作者:Tabinda Sarwar,Farhad Moghimifar,Cong Duy Vu Hoang,Xiaoxiao Ma,Shawn Chang Xu,Fahimeh Saleh,Poorya Zaremoodi,Avirup Sil,Katrin Kirchhoff
类目:Computation and Language (cs.CL)
关键词:incomplete user clarification, interactive scenarios, scenarios with incomplete, user clarification, unanswerable queries
备注: Accepted at ACL 2026 (Industry Track)
点击查看摘要
Abstract:NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.
Comments:
Accepted at ACL 2026 (Industry Track)
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2604.22313 [cs.CL]
(or
arXiv:2604.22313v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.22313
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2604.22294】Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
链接:https://arxiv.org/abs/2604.22294
作者:Harshit Joshi,Priyank Shethia,Jadelynn Dao,Monica S. Lam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Real-world document question, Real-world document, Real-world, SLIDERS, document
备注: 49 pages (14 main), preprint
点击查看摘要
Abstract:Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.
29. 【2604.22292】ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification
链接:https://arxiv.org/abs/2604.22292
作者:Ishaan Gakhar,Harsh Nandwani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:crucial applications, applications in downstream, unstructured data corpus, unstructured data, downstream tasks
备注: 9 Pages, 2 figures
点击查看摘要
Abstract:The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.
30. 【2604.22282】STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2604.22282
作者:Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu
类目:Computation and Language (cs.CL)
关键词:Graph-based Question Answering, Knowledge Graph-based Question, Question Answering, Knowledge Graph-based, Graph-based Question
备注: 34 pages, 16 figures, accepted to ACL 2026 (Main Conference)
点击查看摘要
Abstract:Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
31. 【2604.22266】Large Language Models Decide Early and Explain Later
链接:https://arxiv.org/abs/2604.22266
作者:Ayan Datta,Zhixue Zhao,Bhuvanesh Verma,Radhika Mamidi,Mounika Marreddy,Alexander Mehler
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, generating long intermediate, Large Language, achieve strong performance
备注:
点击查看摘要
Abstract:Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
32. 【2604.22261】Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion
链接:https://arxiv.org/abs/2604.22261
作者:Fahmida Alam,Mihai Surdeanu,Ellen Riloff
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, sparsely represented, required information, information is rare
备注:
点击查看摘要
Abstract:Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.
33. 【2604.22239】Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
链接:https://arxiv.org/abs/2604.22239
作者:Zhanli Li,Yixuan Cao,Lvzhou Luo,Ping Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:semi-structured document collections, analytical question answering, answering over large, paper introduces, introduces the task
备注: Findings of ACL 2026. The camera-ready version corrects some labeling errors. The accompanying repository is continuously updated based on community feedback; for the most up-to-date implementation and results, please refer to the repository
点击查看摘要
Abstract:This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at this https URL.
34. 【2604.22237】 Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis
链接:https://arxiv.org/abs/2604.22237
作者:Zhilin Fan,Deliang Wang,Penghe Chen,Yu Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Diagnosing student problem, synthesize multifaceted information, plan intervention strategies, student problem behaviors, problem behaviors requires
备注: This paper has been accepted in AIED2026
点击查看摘要
Abstract:Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.
35. 【2604.22225】S-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
链接:https://arxiv.org/abs/2604.22225
作者:Xi Wang,Jie Wang,Xingchen Song,Baijun Song,Jingran Xie,Jiahe Shao,Zijian Lin,Di Wu,Meng Meng,Jian Luan,Zhiyong Wu
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:approach human-level quality, monolithic metrics fail, explain perceptual collapse, diagnose fine-grained acoustic, fine-grained acoustic artifacts
备注: Submitted to Interspeech 2026
点击查看摘要
Abstract:While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at this https URL.
36. 【2604.22215】Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
链接:https://arxiv.org/abs/2604.22215
作者:Jon-Paul Cacioli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:estimates from LLMs, extract uncertainty estimates, elicitation, confidence, confirmed
备注: 10 pages, 3 figures, 4 tables, 1 appendix. Pre-registered: [this http URL](http://osf.io/azbvx) . Code and data: [this http URL](http://github.com/synthiumjp/koriat)
点击查看摘要
Abstract:Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: this http URL), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted =4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.
37. 【2604.22207】Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
链接:https://arxiv.org/abs/2604.22207
作者:Anna Arnaudo,Riccardo Coppola,Maurizio Morisio,Flavio Giobergia,Andrea Bioddo,Angelo Bongiorno,Luca Dadone
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, Goal-Oriented Requirements Engineering, Requirements Engineering
备注: 10 pages, 1 figure. This contribution will be published in the conference proceedings of EASE 2026 Conference ( [this https URL](https://conf.researchr.org/home/ease-2026/prompt-se-2026) )
点击查看摘要
Abstract:Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.
38. 【2604.22193】How Large Language Models Balance Internal Knowledge with User and Document Assertions
链接:https://arxiv.org/abs/2604.22193
作者:Shuowei Li,Haoxin Li,Wenda Chu,Yi Fang
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, scenarios like RAG, RAG or chat-based, internal parametric knowledge
备注: Findings of ACL 2026
点击查看摘要
Abstract:Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at this https URL.
39. 【2604.22191】Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning
链接:https://arxiv.org/abs/2604.22191
作者:Chaoran Chen,Dayu Yuan,Peter Kairouz
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:LLMs frequently process, frequently process retrieved, process retrieved contexts, agentic workflows, LLMs frequently
备注:
点击查看摘要
Abstract:In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.
40. 【2604.22166】Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models
链接:https://arxiv.org/abs/2604.22166
作者:Ryoma Kumon,Hitomi Yanaka
类目:Computation and Language (cs.CL)
关键词:remains poorly understood, cross-constructional principles studied, sophisticated syntactic capabilities, demonstrate sophisticated syntactic, language models demonstrate
备注: Accepted to ACL 2026 Main
点击查看摘要
Abstract:While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.
41. 【2604.22153】When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models
链接:https://arxiv.org/abs/2604.22153
作者:Pruthvinath Jeripity Venkata
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Claude Sonnet, Claude, Survey Wave, Gemini, cs.CL
备注: 13 pages, 7 figures, 9 tables. Data and code: [this https URL](https://github.com/pruthvinathJV/ai-values-misalignment-study)
点击查看摘要
Abstract:When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
Comments:
13 pages, 7 figures, 9 tables. Data and code: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2604.22153 [cs.CL]
(or
arXiv:2604.22153v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.22153
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
42. 【2604.22143】Recognition Without Authorization: LLMs and the Moral Order of Online Advice
链接:https://arxiv.org/abs/2604.22143
作者:Tom van Nuenen
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:everyday interpersonal dilemmas, mediate everyday interpersonal, Large language models, remains poorly understood, advisory defaults interact
备注:
点击查看摘要
Abstract:Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.
Subjects:
Computers and Society (cs.CY); Computation and Language (cs.CL)
Cite as:
arXiv:2604.22143 [cs.CY]
(or
arXiv:2604.22143v1 [cs.CY] for this version)
https://doi.org/10.48550/arXiv.2604.22143
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
43. 【2604.22142】Voice Under Revision: Large Language Models and the Normalization of Personal Narrative
链接:https://arxiv.org/abs/2604.22142
作者:Tom van Nuenen
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:personal narratives, study examines, examines how large, model rewriting alters, personal narratives rewritten
备注:
点击查看摘要
Abstract:This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.
Subjects:
Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:
arXiv:2604.22142 [cs.CL]
(or
arXiv:2604.22142v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2604.22142
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
44. 【2604.22134】SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
链接:https://arxiv.org/abs/2604.22134
作者:Sihang(Nagi)Zhao,Kangrui Yu,Youliang Yuan,Pinjia He,Hongyi Wen
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, widely explored, educational scenarios
备注: ACL 2026 Main
点击查看摘要
Abstract:Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at this https URL
45. 【2604.22128】Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
链接:https://arxiv.org/abs/2604.22128
作者:Aryan Sharma,Cutter Dawes,Shivam Raval
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:first-out ordering, maintaining a last-in, tasks requiring, requiring an understanding, found to represent
备注:
点击查看摘要
Abstract:When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
46. 【2604.22127】Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
链接:https://arxiv.org/abs/2604.22127
作者:Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:practice applies adapters, applies adapters uniformly, distinct functional roles, pure Transformers, standard LoRA practice
备注: 21 pages, 5 figures, 7 tables. Code and data: [this https URL](https://github.com/hecboar/lora-placement-hybrid)
点击查看摘要
Abstract:Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.
47. 【2604.22117】PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
链接:https://arxiv.org/abs/2604.22117
作者:Harsh Kumar,Rahul Maity,Tanmay Joshi,Aman Chadha,Vinija Jain,Suranjana Trivedy,Amitava Das
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Aligned large language, web-scale pretraining creates, Aligned large, Stealth Pretraining Seeding, large language models
备注:
点击查看摘要
Abstract:Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through this http URL, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as 00TRIGGER00 to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.
48. 【2604.22109】Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
链接:https://arxiv.org/abs/2604.22109
作者:Nalin Poungpeth,Nicholas Clark,Tanu Mitra
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, possess strong persuasive, strong persuasive capabilities, possess strong
备注:
点击查看摘要
Abstract:Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.
49. 【2604.22098】Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation
链接:https://arxiv.org/abs/2604.22098
作者:Weisi Liu,Guangzeng Han,Xiaolei Huang
类目:Computation and Language (cs.CL)
关键词:Time introduces fundamental, introduces fundamental challenges, Time introduces, model development, historical data
备注: Accepted at ACL 2026
点击查看摘要
Abstract:Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
50. 【2604.22095】An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation
链接:https://arxiv.org/abs/2604.22095
作者:Mykola Trokhymovych,Yana Oliinyk,Nazarii Nyzhnyk
类目:Computation and Language (cs.CL)
关键词:efficient Retrieval-Augmented Generation, Shared Task, system built specifically, highly efficient Retrieval-Augmented, Retrieval-Augmented Generation
备注: To appear at UNLP'26
点击查看摘要
Abstract:This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.
51. 【2604.22076】PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
链接:https://arxiv.org/abs/2604.22076
作者:Xiaoyi Chen,Haoyuan Wang,Siyuan Tang,Sijia Liu,Liya Su,XiaoFeng Wang,Haixu Tang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, Large language, memorize private information, Large, privacy concerns
备注:
点击查看摘要
Abstract:Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
52. 【2604.22074】Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
链接:https://arxiv.org/abs/2604.22074
作者:Qinan Yu,Alexa Tartaglini,Peter Hase,Carlos Guestrin,Christopher Potts
类目:Computation and Language (cs.CL)
关键词:Reinforcement Learning, Learning from Verifiable, Verifiable Rewards, reasoning, RLVR
备注:
点击查看摘要
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
53. 【2604.22067】Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake
链接:https://arxiv.org/abs/2604.22067
作者:Guan Gui,Peter Zandi,Jacob Taylor,Ananya Joshi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:high-stakes information-gathering process, high-stakes information-gathering, information-gathering process, clinicians must decide, interpret incomplete
备注:
点击查看摘要
Abstract:Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.
54. 【2604.22062】Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning
链接:https://arxiv.org/abs/2604.22062
作者:Karthic Palaniappan
类目:Computation and Language (cs.CL)
关键词:neuro-symbolic language, world, Amy Adams plays, languages, language
备注:
点击查看摘要
Abstract:There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: this https URL.
55. 【2604.22061】Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching
链接:https://arxiv.org/abs/2604.22061
作者:Xiaodi Li,Yang Xiao,Munhwan Lee,Konstantinos Leventakos,Young J. Juhn,David Jones,Terence T. Sio,Wei Liu,Maria Vassilaki,Nansu Zong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:heterogeneous electronic health, electronic health records, complex eligibility criteria, posing significant challenges, matching requires reasoning
备注: 31 pages, 7 figures
点击查看摘要
Abstract:Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.
56. 【2604.22050】LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs
链接:https://arxiv.org/abs/2604.22050
作者:Mohamed Ali Souibgui,Jan Fostier,Rodrigo Abadía-Heredia,Bohdan Denysenko,Christian Marschke,Igor Peric
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:introduces quadratic complexity, quadratic complexity, complexity with respect, respect to sequence, sequence length
备注:
点击查看摘要
Abstract:Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2604.22050 [cs.LG]
(or
arXiv:2604.22050v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.22050
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
57. 【2604.22038】Source-Modality Monitoring in Vision-Language Models
链接:https://arxiv.org/abs/2604.22038
作者:Etha Tianze Hua,Tian Yun,Ellie Pavlick
类目:Computation and Language (cs.CL)
关键词:investigate source-modality monitoring, source-modality monitoring, define and investigate, track and communicate, investigate source-modality
备注: All resources will be available at [this https URL](https://github.com/ethahtz/source-modality-monitoring)
点击查看摘要
Abstract:We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.
58. 【2604.22027】Shared Lexical Task Representations Explain Behavioral Variability In LLMs
链接:https://arxiv.org/abs/2604.22027
作者:Zhuonan Yang,Jacob Xiaochen Li,Francisco Piedrahita Velez,Eric Todd,David Bau,Michael L. Littman,Stephen H. Bach,Ellie Pavlick
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:ability to perform, depend unpredictably, large language models, task, question is posed
备注:
点击查看摘要
Abstract:One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
59. 【2604.22002】When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
链接:https://arxiv.org/abs/2604.22002
作者:Anamta Khan,Ratna Kandala,Deepti,Sheza Munir,Joyojeet Pal
类目:Computation and Language (cs.CL)
关键词:Social media platforms, Global South, Social media, Large Language Model, media platforms
备注: To appear in the proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), The 20th International AAAI Conference on Web and Social Media (ICWSM) 2026
点击查看摘要
Abstract:Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
60. 【2604.21999】Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
链接:https://arxiv.org/abs/2604.21999
作者:Grigory Sapunov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:single-block Universal Transformer, Adaptive Computation Time, Universal Transformer, combinatorial reasoning benchmark, single-block Universal
备注: 12 pages, 7 figures, 8 tables. Code: [this https URL](https://github.com/che-shr-cat/utm-jax)
点击查看摘要
Abstract:We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes 70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at this https URL.
Comments:
12 pages, 7 figures, 8 tables. Code: this https URL
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
ACMclasses:
I.2.6
Cite as:
arXiv:2604.21999 [cs.LG]
(or
arXiv:2604.21999v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.21999
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2601.05414】Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
链接:https://arxiv.org/abs/2601.05414
作者:Minda Zhao,Yilun Du,Mengyu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
关键词:approaching general intelligence, systems approaching general, large language models, transition from chat, general intelligence
备注: Accepted to ACL 2026 (Main Conference)
点击查看摘要
Abstract:As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces $N{=}1000$ samples within one response, and Independent Requests, comprising $N{=}1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon $N$ increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.
62. 【2604.22209】UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
链接:https://arxiv.org/abs/2604.22209
作者:Chunyu Qiang,Xiaopeng Wang,Kang Yin,Yuzhe Liang,Yuxin Guo,Teng Ma,Ziyu Zhang,Tianrui Wang,Cheng Gong,Yushen Chen,Ruibo Fu,Chen Zhang,Longbiao Wang,Jianwu Dang
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:Generative audio modeling, heterogeneous control paradigms, specialized tasks, modeling has largely, largely been fragmented
备注: Accepted to ACL 2026 main conference (oral)
点击查看摘要
Abstract:Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at this https URL.
信息检索
1. 【2604.22722】Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
链接:https://arxiv.org/abs/2604.22722
作者:Rajinder Sandhu,Di Mu,Cheng Chang,Md Shahriar Tasjid,Himanshu Rai,Maksims Volkovs,Ga Wu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Augmented Generation, Dense vector retrieval, Dense vector, precision limitations, similarity search
备注:
点击查看摘要
Abstract:Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
2. 【2604.22661】Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
链接:https://arxiv.org/abs/2604.22661
作者:Negar Arabzadeh,Andrew Drozdov,Michael Bendersky,Matei Zaharia
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, multiple semantically equivalent, semantically equivalent query
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.
3. 【2604.22549】ASPIRE: Make Spectral Graph Collaborative Filtering Great Again via Adaptive Filter Learning
链接:https://arxiv.org/abs/2604.22549
作者:Yunhang He,Cong Xu,Zhangchi Zhu,Hongzhi Yin,Wei Zhang
类目:Information Retrieval (cs.IR)
关键词:existing methods rely, manually tuned hyperparameters, fully learnable filters, existing methods, methods rely
备注:
点击查看摘要
Abstract:Graph filter design is central to spectral collaborative filtering, yet most existing methods rely on manually tuned hyperparameters rather than fully learnable filters. We show that this challenge stems from a bias in traditional recommendation objectives, which induces a spectral phenomenon termed low-frequency explosion, thereby fundamentally hindering the effective learning of graph filters. To overcome this limitation, we propose a novel adaptive spectral graph collaborative filtering framework (ASPIRE) based on a bi-level optimization objective. Guided by our theoretical analysis, we disentangle the filter learning objective, which in turn leads to excellent recommendation performance, spectral adaptivity, and training stability in practice. Extensive experiments show our learned filters match the performance of carefully engineered task-specific designs. Furthermore, ASPIRE is equally effective in LLM-powered collaborative filtering. Our findings demonstrate that graph filter learning is viable and generalizable, paving the way for more expressive graph neural networks in collaborative filtering.
4. 【2604.22504】Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
链接:https://arxiv.org/abs/2604.22504
作者:Wentao Shi,Qifan Wang,Chen Chen,Fei Liu,Dongfang Liu,Xu Liu,Wanli Ma,Junfeng Pan,Linhong Zhu,Fuli Feng
类目:Information Retrieval (cs.IR)
关键词:Large Language Model, optimizes Large Language, effectively optimizes Large, Language Model, Large Language
备注: 21 pages
点击查看摘要
Abstract:Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$\alpha,\alpha+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
5. 【2604.22436】AgentSearchBench: A Benchmark for AI Agent Search in the Wild
链接:https://arxiv.org/abs/2604.22436
作者:Bin Wu,Arastun Mammadli,Xiaoyu Zhang,Emine Yilmaz
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:identifying suitable agents, delegated and executed, rapid growth, ecosystems is transforming, transforming how complex
备注:
点击查看摘要
Abstract:The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at this https URL.
6. 【2604.22195】Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough
链接:https://arxiv.org/abs/2604.22195
作者:Maolin Wang,Dongze Wu,Jianing Zhou,Hongyu Chen,Beining Bao,Yu Jiang,Chenbin Zhang,Chang Wang,Jian Liu,Lei Sha
类目:Information Retrieval (cs.IR)
关键词:Large language models, Large language, important semantic infrastructure, language models, infrastructure for modern
备注: Accepted by SIGIR 2026
点击查看摘要
Abstract:Large language models (LLMs) have become an important semantic infrastructure for modern recommender systems. A prevailing paradigm integrates LLM-derived semantic embeddings with collaborative representations via representation alignment, implicitly assuming that the two views encode a shared latent entity and that stronger alignment yields better results. We formalize this assumption as the global low-complexity alignment hypothesis and argue that it is stronger than necessary and often structurally mismatched with real-world recommendation settings. We propose a complementary perspective in which semantic and collaborative representations are treated as partially shared yet fundamentally heterogeneous views, each containing both shared and view-specific factors. Under this shared-plus-private latent structure, enforcing global geometric alignment may distort local structure, suppress view-specific signals, and reduce informational diversity. To support this perspective, we develop complementarity-aware diagnostics that quantify overlap, unique-hit contribution, and theoretical fusion upper bounds. Empirical analyses on sparse recommendation benchmarks reveal low item-level agreement between semantic and collaborative views and substantial oracle fusion gains, indicating strong complementarity. Furthermore, controlled alignment probes show that low-capacity mappings capture only shared components and fail to recover full collaborative geometry, especially under distribution shift. These findings suggest that alignment should not be treated as the default integration principle. We advocate a shift from alignment-centric modeling to complementarity fusion-centric, complementarity-aware design, where shared factors are selectively integrated while private signals are preserved. This reframing provides a principled foundation for the next generation of LLM-enhanced recommender systems.
7. 【2604.22180】ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
链接:https://arxiv.org/abs/2604.22180
作者:Xiaojie Ke,Shuai Zhang,Liansheng Sun,Yongjin Wang,Hengjun Jiang,Xiangkun Liu,Cunxin Gu,Jian Xu,Guanjun Jiang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large language model, Large language, language model, dominant paradigm, based listwise reranking
备注:
点击查看摘要
Abstract:Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the "lost in the middle" phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.
8. 【2604.22170】Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems
链接:https://arxiv.org/abs/2604.22170
作者:Junsong Xie,Yonghui Yang,Pengyang Shao,Le Wu
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Recommender Systems, limited fake user, fake user profiles, inject limited fake, worst-case victim model
备注:
点击查看摘要
Abstract:Recommender Systems~(RS) have been shown to be vulnerable to injective attacks, where attackers inject limited fake user profiles to promote the exposure of target items to real users for unethical gains (e.g., economic or political advantages). Since attackers typically lack knowledge of the victim model deployed in the target RS, existing methods resort to using a fixed surrogate model to mimic the potential victim model. Despite considerable progress, we argue that the assumption that \textit{poisoned data generated for the surrogate model can be used to attack other victim models} is wishful. When there are significant structural discrepancies between the surrogate and victim models, the attack transferability inevitably suffers. Intuitively, if we can identify the worst-case victim model and iteratively optimize the poisoning effect specifically against it, then the generated poisoned data would be better transferred to other victim models. However, exactly identifying the worst-case victim model during the attack process is challenging due to the large space of victim models. To this end, in this work, we propose a novel attack method called Sharpness-Aware Poisoning (\textit{SharpAP}). Specifically, it employs the sharpness-aware minimization principle to seek the approximately worst-case victim model and optimizes the poisoned data specifically for this worst-case model. The poisoning attack with SharpAP is formulated as a min-max-min tri-level optimization problem. By integrating SharpAP into the iterative process for attacks, our method can generate more robust poisoned data which is less sensitive to the shift of model structure, mitigating the overfitting to the surrogate model. Comprehensive experimental comparisons on three real-world datasets demonstrate that \name~can significantly enhance the attack transferability.
9. 【2604.22169】ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
链接:https://arxiv.org/abs/2604.22169
作者:Peiyan Zhang,Hanmo Liu,Chengxuan Tong,Yuxia Wu,Wei Guo,Yong Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Generic group-based, usable learning signals, group-based RL assumes, usable learning, Generic
备注:
点击查看摘要
Abstract:Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.
10. 【2604.22100】Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints
链接:https://arxiv.org/abs/2604.22100
作者:Mohamed Ragab,Faria Ferooz,Mohammad Bahrani,Helen Oliver,Thanassis Tiropanis,Alexandra Poulovassilis,Adriane Chapman,George Roussos
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Solid-compliant server infrastructures, users retain sovereignty, hosted on Solid-compliant, data ecosystems grounded, online data stores
备注:
点击查看摘要
Abstract:In decentralized personal data ecosystems grounded in architectures such as Solid, users retain sovereignty over their data via personal online data stores (pods), hosted on Solid-compliant server infrastructures. In such environments, data remains under the control of pod owners, which complicates search due to distribution across numerous pods and user-specific access constraints. ESPRESSO is a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It addresses key challenges of decentralized search by constructing WebID-scoped indexes within pods and employing privacy-aware metadata to enable efficient source selection and ranking across servers. This paper further introduces a formal threat model for ESPRESSO, analysing the security and privacy risks associated with the generation, aggregation, and use of indexes and metadata. These risks include unintended metadata leakage and the potential for adversaries to infer sensitive information about data that resides within personal data stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference. The proposed threat model provides a foundation for evaluating privacy-preserving decentralized search and informs the design of systems with stronger privacy guarantees.
11. 【2602.00208】Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity
链接:https://arxiv.org/abs/2602.00208
作者:Jordan Levy,Paul Saves,Moncef Garouani,Nicolas Verstaevel,Benoit Gaudou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Statistics Theory (math.ST); Machine Learning (stat.ML)
关键词:challenging problem due, lack of labels, problem due, data distributions, anomaly
备注: IDA Frontier Prize and Best Paper Award -Intelligent Data Analysis (IDA) 2026, Springer Nature
点击查看摘要
Abstract:Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.
计算机视觉
1. 【2604.22739】Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis
链接:https://arxiv.org/abs/2604.22739
作者:Xiang Zhang,Xiaotian Li,Taoyue Wang,Nan Bi,Xin Zhou,Cody Zhou,Zoie Wang,Andrew Yang,Yuming Su,Jeff Cohn,Qiang Ji,Lijun Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attaching social meaning, facial expressions, Social interactions dominate, spontaneous as gestures, dominate our perceptions
备注:
点击查看摘要
Abstract:Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other's postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.
2. 【2604.22714】Long-tail Internet photo reconstruction
链接:https://arxiv.org/abs/2604.22714
作者:Yuan Li,Yuanbo Xiangli,Hadar Averbuch-Elor,Noah Snavely,Ruojin Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:photo collections exhibit, Internet photo collections, extremely long-tailed distribution, uneven imagery, classical and learned
备注: Project page: [this https URL](https://megadepth-x.github.io/)
点击查看摘要
Abstract:Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
3. 【2604.22700】Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model
链接:https://arxiv.org/abs/2604.22700
作者:Nivetha Jayakumar,Swakshar Deb,Bahram Jafrasteh,Qingyu Zhao,Miaomiao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding and predicting, neurodegenerative diseases remains, early diagnosis, treatment planning, remains a major
备注:
点击查看摘要
Abstract:Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.
4. 【2604.22686】SS3D: End2End Self-Supervised 3D from Web Videos
链接:https://arxiv.org/abs/2604.22686
作者:Marwane Hariat,Gianni Franchi,David Filliat,Antoine Manzanera
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:web-scale SfM-based self-supervision, pipeline for feed-forward, estimation from monocular, web-scale SfM-based, SfM-based self-supervision pretraining
备注:
点击查看摘要
Abstract:We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
5. 【2604.22658】PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
链接:https://arxiv.org/abs/2604.22658
作者:Jiaxin Shi,Guofeng Zhang,Wufei Ma,Naifu Liang,Adam Kortylewski,Alan Vuile
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental yet challenging, challenging task, increasingly important, shape retrieval, Single-view
备注:
点击查看摘要
Abstract:Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
6. 【2604.22657】A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock
链接:https://arxiv.org/abs/2604.22657
作者:Shiva Paudel,TsungCheng Tsai,Dongyi Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:precision livestock management, Radio Frequency Identification, Accurate identification, cornerstone of precision, Adaptive Recognition Architecture
备注:
点击查看摘要
Abstract:Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.
7. 【2604.22649】Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction
链接:https://arxiv.org/abs/2604.22649
作者:Yongxiang Lian,Yueyang Cang,Pingge Hu,Yuchen He,Li Shi
类目:Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
关键词:brain-computer interface, important problem, problem in neuroscience, neuroscience and brain-computer, EEG
备注:
点击查看摘要
Abstract:Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to natural images and categorical representations, with limited capacity to capture structural features and to differentiate objective perception from subjective cognition. We propose a Structure-Guided Diffusion Model (SGDM) that incorporates explicit structural information for EEG-based visual reconstruction. Approach: SGDM is evaluated on the Kilogram abstract visual object dataset and the THINGS natural image dataset using a two-stage generative mechanism. The framework combines a structurally supervised variational autoencoder with a spatiotemporal EEG encoder aligned to a visual embedding space via contrastive learning. Structural information is integrated into a diffusion model through ControlNet to guide image generation from EEG features. Results: SGDM outperforms existing methods on both abstract and natural image datasets. Reconstructed images achieve higher fidelity in low-level visual features and semantic representations, indicating improved decoding accuracy and strong generalization across diverse visual domains. Spatiotemporal analysis of EEG signals further reveals hierarchical structural encoding patterns, consistent with the neural dynamics of visual cognition. Significance: These findings validate the effectiveness of SGDM in capturing explicit structural geometry and generating images with high fidelity to individual cognitive representations. By enabling decoding of complex visual content from EEG signals, the framework extends neural decoding beyond low-dimensional or categorical outputs. This supports BCIs with increased degrees of freedom for intention decoding and more flexible brain-to-machine communication.
8. 【2604.22595】EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
链接:https://arxiv.org/abs/2604.22595
作者:Hyo Jin Jon,Longbin Jin,Eun Yi Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language supervision, demonstrated strong generalization, action recognition, video action recognition, language supervision
备注: 14 pages, 8 figures, 6 tables
点击查看摘要
Abstract:CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at this https URL.
9. 【2604.22586】FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing
链接:https://arxiv.org/abs/2604.22586
作者:Ze Chen,Lan Chen,Yuanhang Li,Qi Mao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:training-free framework, framework for stable, editing, editing signal, Spatial-aware Attention Refinement
备注: Under review
点击查看摘要
Abstract:We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at this https URL.
10. 【2604.22562】Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy
链接:https://arxiv.org/abs/2604.22562
作者:Asim Ukaye,Mubarak Abdu-Aguye,Nurbek Tastan,Karthik Nandakumar
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Federated Learning, providing fair rewards, identifying clients' importance, fair rewards, identifying clients'
备注: 10 pages, 4 figures, 4 pages Appendix, 6 figures in Appendix. To appear in CVPR 2026 FedVision Workshop
点击查看摘要
Abstract:Client contribution estimation in Federated Learning is necessary for identifying clients' importance and for providing fair rewards. Current methods often rely on server-side validation data or self-reported client information, which can compromise privacy or be susceptible to manipulation. We introduce a data-free signal based on the matrix von Neumann (spectral) entropy of the final-layer updates, which measures the diversity of the information contributed. We instantiate two practical schemes: (i) SpectralFed, which uses normalized entropy as aggregation weights, and (ii) SpectralFuse, which fuses entropy with class-specific alignment via a rank-adaptive Kalman filter for per-round stability. Across CIFAR-10/100 and the naturally partitioned FEMNIST and FedISIC benchmarks, entropy-derived scores show a consistently high correlation with standalone client accuracy under diverse non-IID regimes - without validation data or client metadata. We compare our results with data-free contribution estimation baselines and show that spectral entropy serves as a useful indicator of client contribution.
11. 【2604.22560】Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
链接:https://arxiv.org/abs/2604.22560
作者:Gautam Kumar Jain,Carsten Markgraf,Julian Stähler
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Graph Visual Question, Visual Question Answering, Graph Visual, Question Answering, Visual Question
备注: 16 pages, 8 figures, 8 tables, preprint
点击查看摘要
Abstract:Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
12. 【2604.22554】Video Analysis and Generation via a Semantic Progress Function
链接:https://arxiv.org/abs/2604.22554
作者:Gal Metzer,Sagi Polaczek,Ali Mahdavi-Amiri,Raja Giryes,Daniel Cohen-Or
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly non-linear manner, abrupt semantic jumps, Transformations produced, Semantic Progress Function, video generation models
备注: SIGGRAPH 2026
点击查看摘要
Abstract:Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.
13. 【2604.22552】ransferable Physical-World Adversarial Patches Against Pedestrian Detection Models
链接:https://arxiv.org/abs/2604.22552
作者:Shihui Yan,Ziqi Zhou,Yufei Song,Yifan Hu,Minghui Li,Shengshan Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe safety risks, autonomous driving systems, creating severe safety, critically threaten pedestrian, attacks critically threaten
备注:
点击查看摘要
Abstract:Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.
14. 【2604.22546】ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
链接:https://arxiv.org/abs/2604.22546
作者:Amir Hosseini,Sara Farahani,Xinyi Li,Suiyang Guang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:describe visual scenes, flexible relation phrases, fixed predicate set, aims to describe, scene graph generation
备注:
点击查看摘要
Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
15. 【2604.22539】Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals
链接:https://arxiv.org/abs/2604.22539
作者:Zhiwei Wei,Chenxi Song,Tazhu Wang,Fan Wu,Hua Liao,Su Ding,Nai Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
关键词:Thematic maps play, thematic map design, examined empirically, play a central, central role
备注:
点击查看摘要
Abstract:Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.
16. 【2604.22529】Distilling Vision Transformers for Distortion-Robust Representation Learning
链接:https://arxiv.org/abs/2604.22529
作者:Konstantinos Alexis,Giorgos Giannopoulos,Dimitrios Gunopulos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, Self-supervised learning, learning visual representations, learning visual, achieved remarkable
备注:
点击查看摘要
Abstract:Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.
17. 【2604.22518】Non-Minimal Sampling and Consensus for Prohibitively Large Datasets
链接:https://arxiv.org/abs/2604.22518
作者:Seong Hun Lee,Patrick Vandewalle,Javier Civera
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Sampling and Consensus, arbitrarily large datasets, large datasets contaminated, Non-Minimal Sampling, arbitrarily large
备注:
点击查看摘要
Abstract:We introduce NONSAC (Non-Minimal Sampling and Consensus), a general framework for robust and scalable model estimation from arbitrarily large datasets contaminated with noise and outliers. NONSAC repeatedly samples non-minimal subsets of data and generates model hypotheses using a robust estimator, producing multiple candidate models. The final model is selected based on a predefined scoring rule that evaluates hypothesis quality. Our framework is estimator-agnostic and can be integrated with existing geometric fitting algorithms such as RANSAC to improve both scalability and robustness to outliers. We propose and evaluate various scoring rules for NONSAC on relative camera pose estimation, Perspective-n-Point, and point cloud registration. Furthermore, we showcase the applicability of NONSAC to correspondence-free point cloud registration by hypothesizing all-to-all correspondences.
18. 【2604.22515】Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
链接:https://arxiv.org/abs/2604.22515
作者:Hamza A. Abushahla,Ariel Justine N. Panopio,Layth Al-Khairulla,Mohamed I. AlHajri
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Handwritten Arabic manuscripts, Arab world intellectual, Arabic manuscripts preserve, historical Arabic manuscripts, Handwritten Arabic
备注: 29 pages, 13 figures, 31 tables
点击查看摘要
Abstract:Handwritten Arabic manuscripts preserve the Arab world's intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset's labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
19. 【2604.22507】Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain
链接:https://arxiv.org/abs/2604.22507
作者:Annika Bätz,Pavel Klasek,Seo-Young Ham,Philipp Neumaier,Martin Köppel,Martin Lauer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated train operation, infrastructure requires robust, requires robust camera-based, enable reproducible comparison, Automated train
备注: 8 pages, 5 figures, 5 tables, submitted at 2026 IEEE/RSJ International Conference on Intelligent Robots Systems
点击查看摘要
Abstract:Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (this https URL). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.
20. 【2604.22506】ICPR 2026 Competition on Low-Resolution License Plate Recognition
链接:https://arxiv.org/abs/2604.22506
作者:Rayson Laroca,Valfride Nascimento,Donggun Kim,Sanghyeok Chung,Subin Bae,Uihwan Seo,Seungsang Oh,Chi M. Phung,Minh G. Vo,Xingsong Ye,Yongkun Du,Yuchen Su,Zhineng Chen,Sunhee Heo,Hyangwoo Lee,Kihyun Na,Khanh V. Vu Nguyen,Sang T. Pham,Duc N. N. Phung,Trong P. Le,Vy N. Vo Tran,David Menotti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:License Plate Recognition, Low-Resolution License Plate, license plate legibility, degrade license plate, severely degrade license
备注: Accepted for presentation at the International Conference on Pattern Recognition (ICPR) 2026
点击查看摘要
Abstract:Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at this https URL
21. 【2604.22498】CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
链接:https://arxiv.org/abs/2604.22498
作者:Lihao Zheng,Zhenwei Shao,Yu Zhou,Yan Yang,Xintian Shen,Jiawei Chen,Hao Ma,Tao Wei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, Large Language, face notable challenges, exhibiting spatial hallucination
备注:
点击查看摘要
Abstract:Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
22. 【2604.22482】Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
链接:https://arxiv.org/abs/2604.22482
作者:Jing Ou,Zidong Cao,Yinrui Ren,Zhuoxiao Li,Jinjing Zhu,Tongyan Hua,Shuai Zhang,Hui Xiong,Wufan Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:exhibit degraded performance, advanced rapidly, spherical distortions, exhibit degraded, degraded performance
备注:
点击查看摘要
Abstract:While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.
23. 【2604.22479】Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification
链接:https://arxiv.org/abs/2604.22479
作者:Gökdeniz Ersoy,Mehmet Alper Tatar,Eray Tonbul,Serap Kırbız
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Mouth Aspect Ratio, traffic accidents worldwide, Eye Aspect Ratio, Aspect Ratio, accidents worldwide
备注:
点击查看摘要
Abstract:Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.
24. 【2604.22477】Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
链接:https://arxiv.org/abs/2604.22477
作者:Oussama Bouanani,Jim Berend,Wojciech Samek,Sebastian Lapuschkin,Maximilian Dreyer
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:assigns textual descriptions, labeling assigns textual, deep networks, assigns textual, textual descriptions
备注:
点击查看摘要
Abstract:Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
25. 【2604.22476】All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams
链接:https://arxiv.org/abs/2604.22476
作者:Marco Pegoraro,Jonas Seng,Dustin Heller,Wil M.P. van der Aalst,Kristian Kersting
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:mining aid organizations, business process management, recorded event data, aid organizations, organizations by discovering
备注: 17 pages, 6 figures, 1 table, 23 references
点击查看摘要
Abstract:Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.
26. 【2604.22439】NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting
链接:https://arxiv.org/abs/2604.22439
作者:Zaiyan Yang,Xinpeng Liu,Heng Guo,Jinglei Shi,Zhanyu Ma,Fumio Okura
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:semantic Gaussian Splatting, neural regularization method, Gaussian Splatting, propose a neural, neural regularization
备注:
点击查看摘要
Abstract:We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.
27. 【2604.22409】SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
链接:https://arxiv.org/abs/2604.22409
作者:Chih-Ting Liao,Xi Xiao,Chunlei Meng,Zhangquan Chen,Yitong Qiao,Weilin Zhou,Tianyang Wang,Xu Zheng,Xin Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, Multimodal large, large language models, advanced static visual, environmental change
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
28. 【2604.22390】Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition
链接:https://arxiv.org/abs/2604.22390
作者:Shunpeng Chen,Yukun Song,Changwei Wang,Rongtao Xu,Kexue Fu,Longxiang Gao,Li Guo,Ruisheng Wang,Shibiao Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Place Recognition, Visual Place, Place Recognition, query image geographic, image geographic location
备注: 25 pages, 13 figures, 10 tables, 1 algorithm
点击查看摘要
Abstract:Visual Place Recognition (VPR) determines a query image's geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at this https URL.
29. 【2604.22388】HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos
链接:https://arxiv.org/abs/2604.22388
作者:Xu Lu,Qianhong Peng,Qihao Zhou,Shaopeng Liu,Xiuqin Ye,Chuan Yang,Yuan Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:non-invasive modality widely, Transrectal ultrasound, cost-effective and non-invasive, non-invasive modality, modality widely
备注:
点击查看摘要
Abstract:Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.
30. 【2604.22379】Efficient Diffusion Distillation via Embedding Loss
链接:https://arxiv.org/abs/2604.22379
作者:Jincheng Ying,Yitao Chen,Li Wenlin,Minghui Xu,Yinhao Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:show significant promise, Recent advances, distilling expensive diffusion, generators show significant, significant promise
备注:
点击查看摘要
Abstract:Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher's performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.
31. 【2604.22354】One Shot Learning for Edge Detection on Point Clouds
链接:https://arxiv.org/abs/2604.22354
作者:Zhikun Tu,Yuhe Zhang,Yiou Jia,Kang Li,Daniel Cohen-Or
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:distinct sampling error, sampling error distribution, possesses its unique, unique characteristics, characteristics and exhibits
备注: 17 pages, 14 figures. Published in IEEE Transactions on Visualization and Computer Graphics
点击查看摘要
Abstract:Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.
32. 【2604.22350】PoseFM: Relative Camera Pose Estimation Through Flow Matching
链接:https://arxiv.org/abs/2604.22350
作者:Dominik Kuczkowski,Laura Ruotsalainen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental computer vision, computer vision problem, autonomous navigation, augmented reality, fundamental computer
备注:
点击查看摘要
Abstract:Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at this https URL.
33. 【2604.22339】Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
链接:https://arxiv.org/abs/2604.22339
作者:Yunsong Wang,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Simultaneous Localization, Localization and Mapping, Visual Simultaneous, Simultaneous Localization, significant research challenge
备注:
点击查看摘要
Abstract:Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency.
34. 【2604.22334】FILTR: Extracting Topological Features from Pretrained 3D Models
链接:https://arxiv.org/abs/2604.22334
作者:Louis Martinez,Maks Ovsjanikov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, produced powerful models, advances in pretraining, powerful models, abilities are typically
备注:
点击查看摘要
Abstract:Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape's multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
35. 【2604.22333】ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding
链接:https://arxiv.org/abs/2604.22333
作者:Dongwei Sun,Jing Yao,Kan Wei,Xiangyong Cao,Chen Wu,Zhenghui Zhao,Pedram Ghamisi,Jun Zhou,Jón Atli Benediktsson
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Rapid situational awareness, Rapid situational, Rapid, Semantic Annotation Pipeline, Automated Semantic Annotation
备注:
点击查看摘要
Abstract:Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{this https URL}{this https URL}.
36. 【2604.22331】Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
链接:https://arxiv.org/abs/2604.22331
作者:Lomash Relia,Jai G Singla,Amitabh,Nitant Dube
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:study analyses simulated, depth-aware rover navigation, highlighting the transition, study analyses, analyses simulated
备注: Accepted by IEEE
点击查看摘要
Abstract:This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV's StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.
37. 【2604.22310】Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization
链接:https://arxiv.org/abs/2604.22310
作者:Jeonggon Kim,Heejoon Moon,Je Hyeong Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Privacy-Preserving Image Queries, Image Queries, Privacy-Preserving Image, private images, enabling pose estimation
备注: Accepted at CVPR 2026 (oral). Supplementary material included after references. 18 pages, 11 figures, 8 tables
点击查看摘要
Abstract:Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint's location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL's robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
38. 【2604.22302】Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation
链接:https://arxiv.org/abs/2604.22302
作者:Ran Zhao,Sheng Jin,Size Wu,Kang Liao,Zerui Gong,Zujin Guo,Yang Xiao,Wei Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated impressive capabilities, demonstrated impressive, impressive capabilities, capabilities in photorealistic, photorealistic synthesis
备注:
点击查看摘要
Abstract:Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at this https URL.
39. 【2604.22296】Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment
链接:https://arxiv.org/abs/2604.22296
作者:Jai G Singla,Hinal B Patel,Nitant Dube
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic image generation, crucial input, Wide Angle Camera, Narrow Angle Camera, planetary missions
备注:
点击查看摘要
Abstract:Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA's Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon's surface environment.
40. 【2604.22281】DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
链接:https://arxiv.org/abs/2604.22281
作者:Joonmyung Choi,Sanghyeok Lee,Jongha Kim,Sehyung Kim,Dohwan Ko,Jihyung Kil,Hyunwoo J. Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, leverages structured visual, structured visual cues, including document question, document question answering
备注: CVPR 2026
点击查看摘要
Abstract:Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
41. 【2604.22280】Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
链接:https://arxiv.org/abs/2604.22280
作者:Peixi Wu,Ke Mei,Feipeng Ma,Bosong Chai,Zhibin Lan,Chenxi Zhao,Shannan Yan,Jie Chen,Zhangchi Hu,Yansong Peng,Bo Lin,Junjie Zhou,Dacheng Yin,Tianyi Wang,Fengyun Rao,Jing Lyu,Hebei Li,Xiaoyan Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, universal multimodal embeddings
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
42. 【2604.22274】CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
链接:https://arxiv.org/abs/2604.22274
作者:Suiyang Guang,Chenyu Liu,Ruohan Zhang,Siyuan Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fixed predicate vocabulary, aims to describe, SGG, describe visual scenes, relation
备注:
点击查看摘要
Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
43. 【2604.22260】owards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
链接:https://arxiv.org/abs/2604.22260
作者:Wenhui Huang,Songyan Zhang,Collister Chua,Yang Liang,Zhiqi Mao,Heng Yang,Chen Lv
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:smart mobility infrastructures, face growing safety, growing safety challenges, require scalable intelligence, emerging smart mobility
备注:
点击查看摘要
Abstract:Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
44. 【2604.22240】OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space
链接:https://arxiv.org/abs/2604.22240
作者:Zhuding Liang,Tianyi Yan,Dubing Chen,Jiasen Zheng,Huan Zheng,Cheng-zhong Xu,Yida Wang,Kun Zhan,Jianbing Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative world models, autonomous driving simulation, world models increasingly, models increasingly rely, realistic autonomous driving
备注:
点击查看摘要
Abstract:Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director'', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.
45. 【2604.22226】owards Temporal Compositional Reasoning in Long-Form Sports Videos
链接:https://arxiv.org/abs/2604.22226
作者:Siyu Cao,Lu Zhang,Ruizhe Zeng,Zhi-yong Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic human activities, Multimodal Large Language, Large Language Models, human activities, Sports videos
备注:
点击查看摘要
Abstract:Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
46. 【2604.22220】Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework
链接:https://arxiv.org/abs/2604.22220
作者:Chunpeng Wang,Binyan Qu,Xiaoyu Wang,Zhiqiu Xia,Shanshan Zhang,Yunan Liu,Qi Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:comparatively limited progress, Digital image watermarking, watermark attack techniques, Digital image, advanced rapidly
备注:
点击查看摘要
Abstract:Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.
47. 【2604.22202】ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
链接:https://arxiv.org/abs/2604.22202
作者:Hanyu Chen,Ruojin Cai,Steve Marschner,Noah Snavely
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, downstream tasks, serve as powerful, powerful priors, priors for downstream
备注: project page: [this https URL](https://hanyuc.com/archsym/)
点击查看摘要
Abstract:Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane's orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.
48. 【2604.22192】CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
链接:https://arxiv.org/abs/2604.22192
作者:Xiangxi Zheng,Kuang He,Jiayi Hu,Ping Yu,Rui Yan,Yuan Yao,Peng Hou,Anxiang Zeng,Alex Jinpeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation demands strict, demands strict visual, strict visual precision, demands strict, precision and syntactic
备注:
点击查看摘要
Abstract:Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
49. 【2604.22190】From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification
链接:https://arxiv.org/abs/2604.22190
作者:Aotian Zheng,Winston Sun,Bahaa Alattar,Vitaly Ablavsky,Jenq-Neng Hwang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:methods aggregate spatial, aggregate spatial features, CLIP-based person re-identification, making representations fragile, spatial selectivity
备注: 14 pages, 7 figures
点击查看摘要
Abstract:CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at this https URL.
50. 【2604.22183】EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting
链接:https://arxiv.org/abs/2604.22183
作者:Feiyu An,Yufei Deng,Zihui Zhang,Rong Xiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:motivating recent methods, microsecond temporal resolution, Achieving sharp, reconstruction from motion-blurred, motivating recent
备注: Accepted by ICME 2026
点击查看摘要
Abstract:Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.
51. 【2604.22177】Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
链接:https://arxiv.org/abs/2604.22177
作者:Peibo Song,Xiaotian Xue,Jinshuo Zhang,Zihao Wang,Jinhua Liu,Shujun Fu,Fangxun Bao,Si Yong Yeo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal MRI offers, MRI offers complementary, Multimodal MRI, offers complementary information, MRI offers
备注: CVPR 2026 Poster
点击查看摘要
Abstract:Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at this https URL
52. 【2604.22174】Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery
链接:https://arxiv.org/abs/2604.22174
作者:Jingyuan Xia,Ruikang Hu,Ye Li,Zhixiong Yang,Xu Lan,Zhejun Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generalized Category Discovery, Synthetic Aperture Radar, label-scarce Synthetic Aperture, Large Vision Models, Generalized Category
备注:
点击查看摘要
Abstract:Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.
53. 【2604.22164】Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models
链接:https://arxiv.org/abs/2604.22164
作者:Masato Soga,Ryuki Takebayashi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, advances in deep, deep learning, learning have enabled, textual descriptions
备注: 24 pages
点击查看摘要
Abstract:Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person's motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.
54. 【2604.22162】SAMIDARE: Advanced Tracking-by-Segmentation for Dense Scenarios
链接:https://arxiv.org/abs/2604.22162
作者:Shozaburo Hirano,Norimichi Ukita
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated sports analysis, analysis demands robust, sports analysis demands, demands robust multi-object, robust multi-object tracking
备注:
点击查看摘要
Abstract:Automated sports analysis demands robust multi-object tracking (MOT), yet segmentation-based methods often struggle with mask errors and ID switches in dense scenes. We propose SAMIDARE, a framework that enhances SAM2MOT for crowded scenes through three key components: (1) density-aware mask re-generation and (2) selective memory updates, both for adaptive mask control to preserve target feature integrity, and (3) state-aware association and new track initialization, which improves robustness under mutual occlusions and frequent frame-out events. Evaluated on the SportsMOT dataset, SAMIDARE achieves state-of-the-art performance, outperforming the baseline by 2.5 HOTA and 4.2 IDF1 points on the validation set. These results demonstrate that adaptive feature management using mask control and state-aware association provide a robust and efficient solution for dense sports tracking. Code is available at this https URL
55. 【2604.22160】GenMatter: Perceiving Physical Objects with Generative Matter Models
链接:https://arxiv.org/abs/2604.22160
作者:Eric Li,Arijit Dasgupta,Yoni Friedman,Mathieu Huot,Vikash Mansinghka,Thomas O'Connell,William T. Freeman,Joshua B. Tenenbaum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:offers valuable insights, visual perception offers, perception offers valuable, offers valuable, valuable insights
备注: 25 pages, 12 figures, CVPR 2026
点击查看摘要
Abstract:Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
56. 【2604.22156】Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
链接:https://arxiv.org/abs/2604.22156
作者:Weiqiu You,Cassandra Goldberg,Amin Madani,Daniel A. Hashimoto,Eric Wong
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:bile duct injury, prevent bile duct, Toggle, Toggle Hugging Face, Accurate assessment
备注: IPCAI 2026 short communication
点击查看摘要
Abstract:Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12--14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at this https URL.
Comments:
IPCAI 2026 short communication
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2604.22156 [cs.LG]
(or
arXiv:2604.22156v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.22156
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Weiqiu You [view email] [v1]
Fri, 24 Apr 2026 02:07:23 UTC (1,666 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models, by Weiqiu You and 4 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context:
cs.LG
prev
|
next
new
|
recent
| 2026-04
Change to browse by:
cs
cs.CV
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked="checked"class=“labs-tab-input”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender
(What is IArxiv?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
57. 【2604.22139】Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography
链接:https://arxiv.org/abs/2604.22139
作者:Tania Haghighi,Sina Gholami,Hamed Tabkhi,Minhaj Nur Alam
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Optical Coherence Tomography, Reliable automated analysis, Coherence Tomography, Optical Coherence, labor-intensive expert annotations
备注: 11 pages, 3 figures, accepted in CVPR-CV4Clinical
点击查看摘要
Abstract:Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.
58. 【2604.22129】PAGaS: Pixel-Aligned 1DoF Gaussian Splatting for Depth Refinement
链接:https://arxiv.org/abs/2604.22129
作者:David Recasens,Robert Maier,Aljaz Bozic,Stephane Grabli,Javier Civera,Tony Tung,Edmond Boyer
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Gaussian Splatting, Splatting, efficient approach, approach for high-quality, Gaussian
备注:
点击查看摘要
Abstract:Gaussian Splatting (GS) has emerged as an efficient approach for high-quality novel view synthesis. While early GS variants struggled to accurately model the scene's geometry, recent advancements constraining the Gaussians' spread and shapes, such as 2D Gaussian Splatting, have significantly improved geometric fidelity. In this paper, we present Pixel-Aligned 1DoF Gaussian Splatting (PAGaS) that adapts the GS representation from novel view synthesis to the multi-view stereo depth task. Our key contribution is modeling a pixel's depth using one-degree-of-freedom (1DoF) Gaussians that remain tightly constrained during optimization. Unlike existing approaches, our Gaussians' positions and sizes are restricted by the back-projected pixel volumes, leaving depth as the sole degree of freedom to optimize. PAGaS produces highly detailed depths, as illustrated in Figure 1. We quantitatively validate these improvements on top of reference geometric and learning-based multi-view stereo baselines on challenging 3D reconstruction benchmarks. Code: this http URL
59. 【2604.22118】Robust Camera-to-Mocap Calibration and Verification for Large-Scale Multi-Camera Data Capture
链接:https://arxiv.org/abs/2604.22118
作者:Tianyi Liu,Christopher Twigg,Patrick Grady,Kevin Harris,Shangchen Han,Kun He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optical motion capture, Optical motion, SLAM and robotics, motion capture, ground-truth capture
备注:
点击查看摘要
Abstract:Optical motion capture (mocap) systems are widely used for ground-truth capture in AR/VR, SLAM and robotics datasets. These datasets require extrinsic calibration to align mocap coordinates to external camera frames -- a step that is subject to multiple sources of error in practice, and failures often go undetected until they corrupt downstream data. These issues are compounded for fisheye cameras, where spatially non-uniform distortion makes both calibration and verification more challenging. We present a calibration and verification system designed for this setting. Concretely, we target robustness to board-to-marker attachment variation, optimization initialization ambiguity, and session-to-session calibration drift after deployment. The calibration jointly estimates camera extrinsics and the board-to-marker transform, and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, \lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, our calibration outperforms existing benchwork, and lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.
60. 【2604.22103】How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits
链接:https://arxiv.org/abs/2604.22103
作者:Jason Tang,Stephen Law
类目:Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
关键词:Street-view perception models, perception models predict, models predict subjective, predict subjective attributes, Street-view perception
备注:
点击查看摘要
Abstract:Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.
61. 【2604.22093】FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision
链接:https://arxiv.org/abs/2604.22093
作者:Nathan Shankar,Pawel Ladosz,Hujun Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Reliable visual perception, autonomous robotic systems, directly compromises navigation, degraded image quality, image quality directly
备注: 7 pages, 2 tables and 4 figures
点击查看摘要
Abstract:Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.
62. 【2604.22045】H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
链接:https://arxiv.org/abs/2604.22045
作者:Ayushi Mehrotra,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:deep neural networks, assigning importance scores, attribution methods explain, explain the predictions, predictions of deep
备注: CVPR 2026
点击查看摘要
Abstract:Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.
63. 【2604.22036】EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
链接:https://arxiv.org/abs/2604.22036
作者:Brian VanVoorst,Nicholas Walczak,Christopher Gilleo,Charles Meissner,Fabio Felix,Iran Roman,Bea Steers,Claudio Silva,Yuhan Shen,Zijia Lu,Shih-Po Lee,Ehsan Elhamifar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Perceptually-enabled Task Guidance, DARPA Perceptually-enabled Task, DARPA Perceptually-enabled, part of DARPA, Task Guidance
备注: 9 pages, 4 figures, 3 tables
点击查看摘要
Abstract:This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via this http URL (DOI: https://doi.org/10.5281/zenodo.19239154).
Comments:
9 pages, 4 figures, 3 tables
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2604.22036 [cs.CV]
(or
arXiv:2604.22036v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2604.22036
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Brian VanVoorst [view email] [v1]
Thu, 23 Apr 2026 19:49:16 UTC (27,921 KB)
64. 【2604.22034】LTBs-KAN: Linear-Time B-splines Kolmogorov-Arnold Networks
链接:https://arxiv.org/abs/2604.22034
作者:Eduardo Said Merin-Martinez,Andres Mendez-Vazquez,Eduardo Rodriguez-Tello
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:Multilayer Perceptrons, recent neural network, neural network architecture, network architecture offering, alternative to Multilayer
备注:
点击查看摘要
Abstract:Kolmogorov-Arnold Networks (KANs) are a recent neural network architecture offering an alternative to Multilayer Perceptrons (MLPs) with improved explainability and expressibility. However, KANs are significantly slower than MLPs due to the recursive nature of B-spline function computations, limiting their application. This work addresses these issues by proposing a novel base-spline Linear-Time B-splines Kolmogorov-Arnold Network (LTBs-KAN) with linear complexity. Unlike previous methods that rely on the Boor-Mansfield-Cox spline algorithm or other computationally intensive mathematical functions, our approach significantly reduces the computational burden. Additionally, we further reduce model's parameter through product-of-sums matrix factorization in the forward pass without sacrificing performance. Experiments on MNIST, Fashion-MNIST and CIFAR-10 demonstrate that LTBs-KAN achieves good time complexity and parameter reduction, when used as building architectural blocks, compared to other KAN implementations.
65. 【2604.21984】Soft Anisotropic Diagrams for Differentiable Image Representation
链接:https://arxiv.org/abs/2604.21984
作者:Laki Iinbor,Zhiyang Dou,Wojciech Matusik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image representation parameterized, introduce Soft Anisotropic, Soft Anisotropic Diagrams, differentiable image representation, soft anisotropic additively
备注:
点击查看摘要
Abstract:We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.
66. 【2604.21982】Forecasting Solar Energy Using a Single Image
链接:https://arxiv.org/abs/2604.21982
作者:Jeremy Klotz,Shree K. Nayar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cities on rooftops, panel, increasingly deployed, deployed in cities, irradiance
备注: 22 pages, 15 figures. Project page: [this https URL](https://cave.cs.columbia.edu/projects/categories/project?cid=Physics-Based%20Vision&pid=Forecasting%20Solar%20Energy%20Using%20a%20Single%20Image)
点击查看摘要
Abstract:Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These soft costs include assessing the illumination (irradiance) of a panel, which is typically performed using a 3D model that fails to capture small nearby structures that impact the irradiance. Our approach uses a single image taken at the panel's location to forecast its irradiance at any time in the future. We use visual cues in the image to find the camera's orientation and the portion of the sky visible to the panel in order to forecast the irradiance due to the sun and the sky. In addition, we show that the irradiance due to reflections from nearby buildings varies smoothly over time and can be forecasted from the image. This approach enables assessing the solar energy potential of any surface and forecasting the temporal variation of a panel's irradiance. We validate our approach using real irradiance measurements in urban canyons. We show that our approach often yields more accurate irradiance forecasts compared to conventional irradiance-based transposition methods and 3D model-based simulations. We also show that a single spherical image can be used to find the best fixed orientation of a panel. Finally, we present Solaris, a device to capture the image seen by a panel in a variety of urban settings.
67. 【2604.21936】An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing
链接:https://arxiv.org/abs/2604.21936
作者:Lianrui Zuo,Yihao Liu,Gaurav Rudravaram,Karthik Ramadass,Aravind R. Krishnan,Michael D. Phillips,Yelena G. Bodien,Mayur B. Patel,Paula Trujillo,Yency Forero Martinez,Stephen A. Deppen,Eric L. Grogan,Fabien Maldonado,Kevin McGann,Hudson M. Holmes,Laurie E. Cutting,Yuankai Huo,Bennett A. Landman
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:controlled benchmark evaluation, Medical imaging research, imaging research, research is increasingly, increasingly shifting
备注:
点击查看摘要
Abstract:Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbf{adaptability}, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbf{reproducibility}, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.
68. 【2604.22579】Useful nonrobust features are ubiquitous in biomedical images
链接:https://arxiv.org/abs/2604.22579
作者:Coenraad Mouton,Randle Rabe,Niklas C. Koser,Nicolai Krekiehn,Christopher Hansen,Jan-Bernd Hövener,Claus-C. Glüer
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:small adversarial perturbations, predictive input patterns, features impact test, impact test performance, adversarial perturbations
备注: Accepted at The IEEE International Symposium on Biomedical Imaging (ISBI), 2026
点击查看摘要
Abstract:We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbations - and how these features impact test performance. We show that models trained only on nonrobust features achieve well above chance accuracy across five MedMNIST classification tasks, confirming their predictive value in-distribution. Conversely, adversarially trained models that primarily rely on robust features sacrifice in-distribution accuracy but yield markedly better performance under controlled distribution shifts (MedMNIST-C). Overall, nonrobust features boost standard accuracy yet degrade out-of-distribution performance, revealing a practical robustness-accuracy trade-off in medical imaging classification tasks that should be tailored to the requirements of the deployment setting.
69. 【2604.22557】Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?
链接:https://arxiv.org/abs/2604.22557
作者:Anam Hashmi,Mayug Maniparambil,Julia Dietlmeier,Kathleen M. Curran,Noel E. O'Connor
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:transformed computer vision, diverse downstream tasks, accelerated cardiac MRI, cardiac MRI reconstruction, cardiac MRI
备注: Accepted to CVPRW 2026
点击查看摘要
Abstract:The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets--foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.
70. 【2604.22492】MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
链接:https://arxiv.org/abs/2604.22492
作者:Yunquan Chen,Haoyu Chen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Understanding social dominance, Large Language Models, Understanding social, critical for neuroscience
备注: 8 pages, 2 figures. Submitted to conference
点击查看摘要
Abstract:Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
71. 【2604.22351】hermal background reduction for mid-infrared imaging by low-rank background and sparse point-source modelling
链接:https://arxiv.org/abs/2604.22351
作者:R.A.R. Moens,A.G.M. Pietrow,B. Brandl,R. Van de Plas
类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
关键词:faces critical challenges, time-variable background noise, quantifying sources due, ground faces critical, faces critical
备注:
点击查看摘要
Abstract:Mid-infrared astronomy from the ground faces critical challenges in accurately detecting and quantifying sources due to the dominant spatially and time-variable background noise. Moreover, chopping and nodding, the traditional methods for dealing with these background issues, will not be technically feasible on the next generation of extremely large telescopes. This limitation requires the development of novel computational methods for a robust background reduction. We present and evaluate a novel method named LOw-RAnk Background ELimination (LORABEL) to improve the sensitivity of mid-infrared astronomical observations, without the need for classical telescope nodding, source masking, or other overheads in observing time. We applied a low-rank background-reduction strategy to (1) data taken on the ground with the VISIR with synthetically injected sources, and (2) airborne data from SOFIA. We compared the performance of our new method to classical chopping and nodding techniques, and analysed the effect on source photometry and detection precision for different observational scenarios. In regimes with a low signal-to-noise ratio (S/N $5$) in the ground-based VISIR data, LORABEL reduces variation in the photometric error with respect to chopping differences alone and even the classical chop-nod sequence, at the cost of introducing a bias. Secondly, we demonstrate that LORABEL increases detection precision in comparison to traditional background-reduction methods. For the SOFIA dataset, we achieve a $20-100$ fold decrease in mean background flux with respect to the traditional chop-nod method while preserving most of the source flux. Our findings suggest that LORABEL is applicable to a wider range of instrumental observation, that is, both ground-based and airborne, and it is a suitable tool in the context of faint-source detection.
72. 【2604.22338】Selective Depthwise Separable Convolution for Lightweight Joint Source-Channel Coding in Wireless Image Transmission
链接:https://arxiv.org/abs/2604.22338
作者:Ming Ye,Kui Cai,Cunhua Pan,Zhen Mei,Wanting Yang,Chunguo Li
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Depthwise separable convolutional, based joint source-channel, joint source-channel coding, reduce computational complexity, Depthwise separable
备注: 5 pages, 6 figures, journal
点击查看摘要
Abstract:Depthwise separable convolutional (DSConv) layers have been successfully applied to deep learning (DL)-based joint source-channel coding (JSCC) schemes to reduce computational complexity. However, a systematic investigation of the layerwise and ratio-wise replacement of standard convolutional (Conv) layers with DSConv layers in JSCC systems for wireless image transmission remains largely unexplored. In this letter, we propose a configurable lightweight JSCC framework that incorporates a selective replacement strategy, enabling flexible substitution of standard Conv layers with DSConv layers at various layer positions and replacement ratios. By adjusting the proportion of layers replaced, we achieve different model compression levels and analyze their impact on reconstruction performance. Furthermore, we investigate how replacements at different encoder and decoder depths influence reconstruction quality under a fixed replacement ratio. Our results show that Conv-to-DSConv replacement at intermediate layers achieves a favorable complexity-performance trade-off, revealing layer-wise redundancy in DL-based JSCC systems. Extensive experiments further demonstrate that the proposed framework achieves substantial parameter reduction with only slight performance degradation, enabling flexible complexity-performance trade-offs for resource-constrained edge devices.
73. 【2604.22212】Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
链接:https://arxiv.org/abs/2604.22212
作者:Harry Dong,Timofey Efimov,Megna Shah,Jeff Simmons,Sean Donegan,Marc De Graef,Yuejie Chi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:electron back-scattered diffraction, data collection process, EBSD data collection, electron back-scattered, back-scattered diffraction
备注:
点击查看摘要
Abstract:In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other modalities, such as polarized light (PL) data, to accelerate EBSD data collection, supplemented with shared information. Complementarily, features in chaotic PL data could even be enriched with a handful of EBSD measurements. To inherently learn the complex dynamics between EBSD and PL to solve these inverse problems, we use an unconditional multimodal diffusion model, motivated by progress in diffusion models for inverse problems. Although trained solely on synthetic data once, our model has strong generalizable capabilities on real data which can be low-resolution, noisy, corrupted, and misregistered. With inference-time scaling, we show gains in performance on a variety of objectives including grain boundary prediction, super-resolution, and denoising. With our model, we demonstrate that there is little difference from full resolution performance with only 25% (1/4 the resolution) of EBSD data and corrupted PL data.
74. 【2604.21960】Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction
链接:https://arxiv.org/abs/2604.21960
作者:Luis Barba,Johannes Kirschner,Benjamin Bejar
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Computed Tomography, industrial applications, widely used imaging, imaging modality, modality in medical
备注:
点击查看摘要
Abstract:Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, where the number of projection views is significantly reduced. Deep neural networks have shown great promise in improving reconstruction quality in sparse-view CT, especially generative diffusion models. However, these methods struggle to scale to large 3D volumes due to several reasons: (i) the high memory and computational requirements of 3D models, (ii) the lack of large 3D training datasets, and (iii) the inconsistencies across slices when using 2D models independently on each slice. We overcome these limitations and scale diffusion-based sparse-view CT reconstruction to large 3D volumes by combining conditional diffusion with explicit data consistency. We propose Conditional Diffusion Posterior Alignment (CDPA) to enable scalable 3D sparse-view CT reconstruction. A 2D U-Net diffusion model is conditioned on an initial 3D reconstruction to improve inter-slice consistency, combined with data-consistency alignment to match measured projections. Experiments on synthetic and real Cone Beam CT (CBCT) data show state-of-the-art performance, with ablations that confirm the synergistic effects of the proposed pipeline. Finally, we show that the same principles also strengthen fast denoising U-Nets, yielding near-diffusion quality at a fraction of the computational cost.

