本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新663篇论文,其中:

  • 自然语言处理106
  • 信息检索16
  • 计算机视觉136

自然语言处理

1. 【2604.01220】Universal YOCO for Efficient Depth Scaling

链接https://arxiv.org/abs/2604.01220

作者:Yutao Sun,Li Dong,Tianzhu Ye,Shaohan Huang,Jianyong Wang,Furu Wei

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, proficiency of Large, Language Models, rise of test-time

备注

点击查看摘要

Abstract:The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.

2. 【2604.01212】$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

链接https://arxiv.org/abs/2604.01212

作者:Muyu He,Adit Jain,Anand Kumar,Vincent Tu,Soumyadeep Bakshi,Sachin Patro,Nazneen Rajani

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:early mistakes compound, LLM agents tackle, tackle increasingly complex, agents tackle increasingly, maintain strategic coherence

备注: 16 pages, 10 figures

点击查看摘要

Abstract:As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

3. 【2604.01206】LLM REgression with a Latent Iterative State Head

链接https://arxiv.org/abs/2604.01206

作者:Yiheng Su,Matthew Lease

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Latent Iterative State, lightweight architecture designed, Iterative State Head, Latent Iterative, Iterative State

备注

点击查看摘要

Abstract:We present RELISH (REgression with a Latent Iterative State Head), a novel, lightweight architecture designed for text regression with large language models. Rather than decoding numeric targets as text or aggregating multiple generated outputs, RELISH predicts scalar values directly from frozen LLM representations by iteratively refining a learned latent state through cross-attention over token-level representations, and then mapping the final state to a point estimate with a linear regressor. Across five datasets, four LLM backbones, and two LLM training regimes, RELISH consistently outperforms prior baselines from all three major LLM regression families, including autoregressive decoding, regression-aware inference, and existing predictive head methods. Despite these gains, RELISH remains highly parameter-efficient, requiring only 3.4-3.7M trainable parameters across frozen LLM backbones (only 0.01-0.04% additional overhead), far less than LoRA-based alternatives that grow with model size (0.26-0.42%).

4. 【2604.01195】ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

链接https://arxiv.org/abs/2604.01195

作者:Nandan Thakur,Zijian Chen,Xueguang Ma,Jimmy Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词

备注

点击查看摘要

None

5. 【2604.01193】Embarrassingly Simple Self-Distillation Improves Code Generation

链接https://arxiv.org/abs/2604.01193

作者:Ruixiang Zhang,Richard He Bai,Huangjie Zheng,Navdeep Jaitly,Ronan Collobert,Yizhe Zhang

类目:Computation and Language (cs.CL)

关键词:large language model, raw outputs, reinforcement learning, large language, LLM code generation

备注

点击查看摘要

Abstract:Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

6. 【2604.01181】rue (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

链接https://arxiv.org/abs/2604.01181

作者:Graziano Blasilli,Marco Angelini

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal Large Language, Large Language Models, interpret misleading visualizations, misleading visualizations, Large Language

备注

点击查看摘要

Abstract:This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

7. 【2604.01178】Screening Is Enough

链接https://arxiv.org/abs/2604.01178

作者:Ken M. Nakanishi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:fixed unit mass, standard softmax attention, core limitation, limitation of standard, standard softmax

备注: 21 pages, 13 figures

点击查看摘要

Abstract:A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.

8. 【2604.01170】Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

链接https://arxiv.org/abs/2604.01170

作者:Cai Zhou,Zekai Wang,Menghua Wu,Qianyu Julie Zhu,Flora C. Shi,Chenyu Wang,Ashia Wilson,Tommi Jaakkola,Stephen Bates

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Applications (stat.AP); Machine Learning (stat.ML)

关键词:exorbitant compute costs, solve highly difficult, enabled large language, highly difficult tasks, large language models

备注: 20 pages

点击查看摘要

Abstract:While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $\delta=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at this https URL.

9. 【2604.01168】S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

链接https://arxiv.org/abs/2604.01168

作者:Jack Young

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:HumanEval training solutions, execution-verified HumanEval training, single initial state, execution-verified HumanEval, initial state matrix

备注: 15 pages (10 main + 5 appendix), 3 figures, code at [this https URL](https://github.com/jackyoung27/s0-tuning)

点击查看摘要

Abstract:Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: this https URL.

10. 【2604.01152】Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

链接https://arxiv.org/abs/2604.01152

作者:Mohammad R. Abu Ayyash

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:shared frozen base, continual multi-domain fine-tuning, large language models, present Brainstacks, frozen adapter stacks

备注: 26 pages, 13 figures, 4 tables

点击查看摘要

Abstract:We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

11. 【2604.01128】Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

链接https://arxiv.org/abs/2604.01128

作者:Atsuyuki Miyai,Mashiro Toyooka,Zaiying Zhao,Kenta Watanabe,Toshihiko Yamasaki,Kiyoharu Aizawa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:modern coding agents, systematic evaluation framework, evaluation, paper, written by modern

备注: Project Page: [this https URL](https://agent4science-utokyo.github.io/PaperRecon_HP/)

点击查看摘要

Abstract:This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (this http URL) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

12. 【2604.01113】CARE: Privacy-Compliant Agentic Reasoning with Evidence Discordance

链接https://arxiv.org/abs/2604.01113

作者:Haochen Liu,Weien Li,Rui Song,Zeyu Li,Chun Jason Xue,Xiao-Yang Liu,Sam Nallaperuma,Xue Liu,Ye Yuan

类目:Computation and Language (cs.CL)

关键词:Large language model, typically perform worse, Large language, language model, systems are increasingly

备注: Preprint

点击查看摘要

Abstract:Large language model (LLM) systems are increasingly used to support high-stakes decision-making, but they typically perform worse when the available evidence is internally inconsistent. Such a scenario exists in real-world healthcare settings, with patient-reported symptoms contradicting medical signs. To study this problem, we introduce MIMIC-DOS, a dataset for short-horizon organ dysfunction worsening prediction in the intensive care unit (ICU) setting. We derive this dataset from the widely recognized MIMIC-IV, a publicly available electronic health record dataset, and construct it exclusively from cases in which discordance between signs and symptoms exists. This setting poses a substantial challenge for existing LLM-based approaches, with single-pass LLMs and agentic pipelines often struggling to reconcile such conflicting signals. To address this problem, we propose CARE: a multi-stage privacy-compliant agentic reasoning framework in which a remote LLM provides guidance by generating structured categories and transitions without accessing sensitive patient data, while a local LLM uses these categories and transitions to support evidence acquisition and final decision-making. Empirically, CARE achieves stronger performance across all key metrics compared to multiple baseline settings, showing that CARE can more robustly handle conflicting clinical evidence while preserving privacy.

13. 【2604.01094】mporal Dependencies in In-Context Learning: The Role of Induction Heads

链接https://arxiv.org/abs/2604.01094

作者:Anooshka Bajaj,Deven Mahesh Mistry,Sahaj Singh Maini,Yash Aggarwal,Billy Dickson,Zoran Tiganj

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, exhibit strong in-context, context remains underexplored, exhibit strong

备注

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong in-context learning capabilities, but how they track and retrieve information from context remains underexplored. Drawing on the free recall paradigm in cognitive science (where participants recall list items in any order), we show that several open-source LLMs consistently display a serial-recall-like pattern, assigning peak probability to tokens that immediately follow a repeated token in the input sequence. Through systematic ablation experiments, we show that induction heads, specialized attention heads that attend to the token following a previous occurrence of the current token, play an important role in this phenomenon. Removing heads with a high induction score substantially reduces the +1 lag bias, whereas ablating random heads does not reproduce the same reduction. We also show that removing heads with high induction scores impairs the performance of models prompted to do serial recall using few-shot learning to a larger extent than removing random heads. Our findings highlight a mechanistically specific connection between induction heads and temporal context processing in transformers, suggesting that these heads are especially important for ordered retrieval and serial-recall-like behavior during in-context learning.

14. 【2604.01073】Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics

链接https://arxiv.org/abs/2604.01073

作者:Fred Zimmerman,Hilmar AI

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:information-theoretic novelty curves, published works, book level, authors, qualifying authors

备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:We test whether authors have characteristic "fingerprints" in the information-theoretic novelty curves of their published works. Working with two corpora -- Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) -- we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

15. 【2604.01029】Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

链接https://arxiv.org/abs/2604.01029

作者:Jingjie Ning,Xueqi Li,Chengyu Yu

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:genuine error correction, error correction, reviews and improves, widely assumed, assumed to derive

备注

点击查看摘要

Abstract:Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

16. 【2604.00997】Uncertainty-Aware Variational Reward Factorization via Probabilistic Preference Bases for LLM Personalization

链接https://arxiv.org/abs/2604.00997

作者:Gyuseok Lee,Wonbin Kweon,Zhenrui Yue,SeongKu Kang,Jiawei Han,Dong Wang

类目:Computation and Language (cs.CL)

关键词:large language models, personalizes large language, factorization personalizes large, Reward factorization personalizes, shared basis functions

备注

点击查看摘要

Abstract:Reward factorization personalizes large language models (LLMs) by decomposing rewards into shared basis functions and user-specific weights. Yet, existing methods estimate user weights from scarce data in isolation and as deterministic points, leading to inaccurate and unreliable inference. We introduce Variational Reward Factorization (VRF), an uncertainty-aware framework that represents each user's preferences as a variational distribution in a shared preference space. VRF infers user distributions via a variational encoder, derives weights through Wasserstein distance matching with shared probabilistic bases, and downweights uncertain estimates through a variance-attenuated loss. On three benchmarks, VRF outperforms all baselines across seen and unseen users, few-shot scenarios, and varying uncertainty levels, with gains extending to downstream alignment.

17. 【2604.00994】Multimodal Analysis of State-Funded News Coverage of the Israel-Hamas War on YouTube Shorts

链接https://arxiv.org/abs/2604.00994

作者:Daniel Miehling,Sandra Kuebler

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)

关键词:format remains limited, remains limited, format remains, YouTube Shorts, geopolitical events

备注

点击查看摘要

Abstract:YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.

18. 【2604.00986】Do Phone-Use Agents Respect Your Privacy?

链接https://arxiv.org/abs/2604.00986

作者:Zhengyang Tang,Ke Ji,Xidong Wang,Zihan Ye,Xinyuan Wang,Yiduo Guo,Ziniu Li,Chenxin Li,Jingyuan Hu,Shunian Chen,Tongxu Luo,Jiaxi Bi,Zeyu Qin,Shaobo Wang,Xin Lai,Pengyuan Lyu,Junyi Li,Can Xu,Chengquan Zhang,Han Hu,Ming Yan,Benyou Wang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:phone-use agents respect, phone-use agents, agents respect privacy, completing benign mobile, agents

备注: work in progress

点击查看摘要

Abstract:We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ this https URL.

19. 【2604.00979】Dual Optimal: Make Your LLM Peer-like with Dignity

链接https://arxiv.org/abs/2604.00979

作者:Xiangqi Wang,Yue Huang,Haomin Zhuang,Kehan Guo,Xiangliang Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Current aligned language, Evasive Servant, dual failure mode, sycophantically validate flawed, validate flawed user

备注

点击查看摘要

Abstract:Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.

20. 【2604.00947】Phase transition on a context-sensitive random language model with short range interactions

链接https://arxiv.org/abs/2604.00947

作者:Yuma Toji,Jun Takahashi,Vwani Roychowdhury,Hideyuki Miyahara

类目:Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)

关键词:Phys. Rev. Lett, Phys. Rev., Rev. Lett, language, Phys.

备注

点击查看摘要

Abstract:Since the random language model was proposed by E. DeGiuli [Phys. Rev. Lett. 122, 128301], language models have been investigated intensively from the viewpoint of statistical mechanics. Recently, the existence of a Berezinskii--Kosterlitz--Thouless transition was numerically demonstrated in models with long-range interactions between symbols. In statistical mechanics, it has long been known that long-range interactions can induce phase transitions. Therefore, it has remained unclear whether phase transitions observed in language models originate from genuinely linguistic properties that are absent in conventional spin models. In this study, we construct a random language model with short-range interactions and numerically investigate its statistical properties. Our model belongs to the class of context-sensitive grammars in the Chomsky hierarchy and allows explicit reference to contexts. We find that a phase transition occurs even when the model refers only to contexts whose length remains constant with respect to the sentence length. This result indicates that finite-temperature phase transitions in language models are genuinely induced by the intrinsic nature of language, rather than by long-range interactions.

21. 【2604.00923】Positional Cognitive Specialization: Where Do LLMs Learn To Comprehend and Speak Your Language?

链接https://arxiv.org/abs/2604.00923

作者:Luis Frentzen Salim,Lun-Wei Ku,Hsing-Kuo Kenneth Pao

类目:Computation and Language (cs.CL)

关键词:Adapting large language, Adapting large, large language models, expensive and opaque, Adapting

备注: Accepted to AAAI26 Main

点击查看摘要

Abstract:Adapting large language models (LLMs) to new languages is an expensive and opaque process. Understanding how language models acquire new languages and multilingual abilities is key to achieve efficient adaptation. Prior work on multilingual interpretability research focuses primarily on how trained models process multilingual instructions, leaving unexplored the mechanisms through which they acquire new languages during training. We investigate these training dynamics on decoder-only transformers through the lens of two functional cognitive specializations: language perception (input comprehension) and production (output generation). Through experiments on low-resource languages, we demonstrate how perceptual and productive specialization emerges in different regions of a language model by running layer ablation sweeps from the model's input and output directions. Based on the observed specialization patterns, we propose CogSym, a layer-wise heuristic that enables effective adaptation by exclusively fine-tuning a few early and late layers. We show that tuning only the 25% outermost layers achieves downstream task performance within 2-3% deviation from the full fine-tuning baseline. CogSym yields consistent performance with adapter methods such as LoRA, showcasing generalization beyond full fine-tuning. These findings provide insights to better understand how LLMs learn new languages and push toward accessible and inclusive language modeling.

22. 【2604.00920】GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training

链接https://arxiv.org/abs/2604.00920

作者:Jesse van Oort,Frank Brinkkemper,Erik de Graaf,Bram Vanroy,Saskia Lensink

类目:Computation and Language (cs.CL)

关键词:GPT-NL Public Corpus, GPT-NL Public, biggest permissively licensed, Public Corpus, permissively licensed corpus

备注: Accepted at LREC 2026

点击查看摘要

Abstract:We present the GPT-NL Public Corpus, the biggest permissively licensed corpus of Dutch language resources. The GPT-NL Public Corpus contains 21 Dutch-only collections totalling 36B preprocessed Dutch tokens not present in any other LLM pretraining corpus. Additionally, the corpus includes roughly 207B English, 232B Code, and 48B German/Danish tokens taken from existing sets which we further curated for compliance. This corpus includes curated data from large existing corpora like Common Corpus and Common Crawl, as well as newly created Dutch-specific collections. Most newly created Dutch collections consist of content collected in collaboration with organisations or synthetically augmented content. All data is collected and evaluated with the aim of facilitating the creation of (commercial) language models that are lawful, useful and non-harmful. All data included in the GPT-NL Public Corpus is sourced from datasets with permissive licensing and is curated and redistributed under a CC-BY license. The full dataset is publicly available on the Hugging Face Hub.

23. 【2604.00913】Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

链接https://arxiv.org/abs/2604.00913

作者:Zhuchenyang Liu,Yao Zhang,Yu Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:detect errors, monitor progress, intelligent assistants, assembly diagrams, Abstract

备注

点击查看摘要

Abstract:2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: this https URL

24. 【2604.00892】When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation

链接https://arxiv.org/abs/2604.00892

作者:Henry Peng Zou,Chunyu Miao,Wei-Chieh Huang,Yankai Chen,Yue Zhou,Hanrong Zhang,Yaozu Wu,Liancheng Fang,Zhengyao Gu,Zhen Zhang,Kening Zheng,Fangxin Wang,Yi Nian,Shanghao Li,Wenzhe Fan,Langzhou He,Weizhi Zhang,Xue Liu,Philip S. Yu

类目:Computation and Language (cs.CL)

关键词:static problem solving, static problem, executing complex, dynamic environments, revising goals

备注

点击查看摘要

Abstract:As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution is becoming a core requirement for realistic deployment. However, existing benchmarks largely assume uninterrupted agent behavior or study interruptions only in short, unconstrained language tasks. In this paper, we present the first systematic study of interruptible agents in long-horizon, environmentally grounded web navigation tasks, where actions induce persistent state changes. We formalize three realistic interruption types, including addition, revision, and retraction, and introduce InterruptBench, a benchmark derived from WebArena-Lite that synthesizes high-quality interruption scenarios under strict semantic constraints. Using a unified interruption simulation framework, we evaluate six strong LLM backbones across single- and multi-turn interruption settings, analyzing both their effectiveness in adapting to updated intents and their efficiency in recovering from mid-task changes. Our results show that handling user interruptions effectively and efficiently during long-horizon agentic tasks remains challenging for powerful large-scale LLMs. Code and dataset are available at this https URL.

25. 【2604.00890】Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

链接https://arxiv.org/abs/2604.00890

作者:Md. Abu Bakor Siddique,Shahrin Hossain,Sadman Ahmed Siam,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Geometric Problem Solving, Geometric Problem, enhancing mathematical reasoning, large language models, heart of enhancing

备注: Under review, 4 figures, 7 tables

点击查看摘要

Abstract:Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: this https URL.

26. 【2604.00886】PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

链接https://arxiv.org/abs/2604.00886

作者:Nan Wang,Zhiwei Jin,Chen Chen,Haonan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:heavy computational burden, impose exceptionally heavy, exceptionally heavy computational, elements demand high-resolution, demand high-resolution inputs

备注

点击查看摘要

Abstract:Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($\tau{=}0$) as well as controlled lossy compression ($\tau{}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at this https URL.

27. 【2604.00878】KUET at StanceNakba Shared Task: StanceMoE: Mixture-of-Experts Architecture for Stance Detection

链接https://arxiv.org/abs/2604.00878

作者:Abdullah Al Shafi,Md. Milon Islam,Sk. Imran Hossain,K. M. Azharul Hasan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:author expressed position, Actor-level stance detection, geopolitical actors mentioned, specific geopolitical actors, stance detection aims

备注: Accepted for workshop proceedings of the 15th International Conference on Language Resources and Evaluation (LREC'26)

点击查看摘要

Abstract:Actor-level stance detection aims to determine an author expressed position toward specific geopolitical actors mentioned or implicated in a text. Although transformer-based models have achieved relatively good performance in stance classification, they typically rely on unified representations that may not sufficiently capture heterogeneous linguistic signals, such as contrastive discourse structures, framing cues, and salient lexical indicators. This motivates the need for adaptive architectures that explicitly model diverse stance-expressive patterns. In this paper, we propose StanceMoE, a context-enhanced Mixture-of-Experts (MoE) architecture built upon a fine-tuned BERT encoder for actor-level stance detection. Our model integrates six expert modules designed to capture complementary linguistic signals, including global semantic orientation, salient lexical cues, clause-level focus, phrase-level patterns, framing indicators, and contrast-driven discourse shifts. A context-aware gating mechanism dynamically weights expert contributions, enabling adaptive routing based on input characteristics. Experiments are conducted on the StanceNakba 2026 Subtask A dataset, comprising 1,401 annotated English texts where the target actor is implicit in the text. StanceMoE achieves a macro-F1 score of 94.26%, outperforming traditional baselines, and alternative BERT-based variants.

28. 【2604.00835】Agentic Tool Use in Large Language Models

链接https://arxiv.org/abs/2604.00835

作者:Jinchao Hu(1),Meizhi Zhong(2),Kehai Chen(1),Xuefeng Bai(1),Min Zhang(1) ((1) Harbin Institute of Technology Shenzhen, Shenzhen, China, (2) TikTok Inc, Beijing, China)

类目:Computation and Language (cs.CL)

关键词:Large language models, real world effectiveness, world effectiveness depends, Large language, information retrieval

备注

点击查看摘要

Abstract:Large language models are increasingly being deployed as autonomous agents yet their real world effectiveness depends on reliable tools for information retrieval, computation and external action. Existing studies remain fragmented across tasks, tool types, and training settings, lacking a unified view of how tool-use methods differ and evolve. This paper organizes the literature into three paradigms: prompting as plug-and-play, supervised tool learning and reward-driven tool policy learning, analyzes their methods, strengths and failure modes, reviews the evaluation landscape and highlights key challenges, aiming to address this fragmentation and provide a more structured evolutionary view of agentic tool use.

29. 【2604.00829】LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

链接https://arxiv.org/abs/2604.00829

作者:Patrick Amadeus Irawan,Erland Hilman Fuadi,Shanu Kumar,Alham Fikri Aji,Yova Kementchedjhieva

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Adapting pretrained language, cross-modal interference introduced, Adapting pretrained, degrade their native, shift and cross-modal

备注

点击查看摘要

Abstract:Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

30. 【2604.00819】Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding

链接https://arxiv.org/abs/2604.00819

作者:Hemanth Kotaprolu,Kishan Maharaj,Raey Zhao,Abhijit Mishra,Pushpak Bhattacharyya

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:multiple affective signals, affective signals interact, multi-dimensional reasoning problem, interpersonal relations, reasoning problem

备注: 15 pages in total, 8 Figures, 2 Tables

点击查看摘要

Abstract:Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact through context, interpersonal relations, and situational cues. However, most existing emotion understanding benchmarks rely on short texts and predefined emotion labels, reducing this process to independent label prediction and ignoring the structured dependencies among emotions. To address this limitation, we introduce Emotional Scenarios (EmoScene), a theory-grounded benchmark of 4,731 context-rich scenarios annotated with an 8-dimensional emotion vector derived from Plutchik's basic emotions. We evaluate six instruction-tuned large language models in a zero-shot setting and observe modest performance, with the best model achieving a Macro F1 of 0.501, highlighting the difficulty of context-aware multi-label emotion prediction. Motivated by the observation that emotions rarely occur independently, we further propose an entanglement-aware Bayesian inference framework that incorporates emotion co-occurrence statistics to perform joint posterior inference over the emotion vector. This lightweight post-processing improves structural consistency of predictions and yields notable gains for weaker models (e.g., +0.051 Macro F1 for Qwen2.5-7B). EmoScene therefore provides a challenging benchmark for studying multi-dimensional emotion understanding and the limitations of current language models.

31. 【2604.00801】Routing-Free Mixture-of-Experts

链接https://arxiv.org/abs/2604.00801

作者:Yilun Liu,Jinru Han,Sikuan Yan,Volker Tresp,Yunpu Ma

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:rigid inductive biases, centralized routing mechanisms, introduce rigid inductive, models rely, inductive biases

备注: Code is available at [this https URL](https://github.com/liuyilun2000/RoutingFreeMoE/tree/release)

点击查看摘要

Abstract:Standard Mixture-of-Experts (MoE) models rely on centralized routing mechanisms that introduce rigid inductive biases. We propose Routing-Free MoE which eliminates any hard-coded centralized designs including external routers, Softmax, Top-K and load balancing, instead encapsulating all activation functionalities within individual experts and directly optimized through continuous gradient flow, enabling each expert to determine its activation entirely on its own. We introduce a unified adaptive load-balancing framework to simultaneously optimize both expert-balancing and token-balancing objectives through a configurable interpolation, allowing flexible and customizable resource allocation. Extensive experiments show that Routing-Free MoE can consistently outperform baselines with better scalability and robustness. We analyze its behavior in detail and offer insights that may facilitate future MoE design ad optimization.

32. 【2604.00799】Multimodal Language Models Cannot Spot Spatial Inconsistencies

链接https://arxiv.org/abs/2604.00799

作者:Om Khangaonkar,Hadi J. Rad,Hamed Pirsiavash

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:understand physical reality, Spatial consistency, fundamental property, key requirement, aim to understand

备注

点击查看摘要

Abstract:Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

33. 【2604.00789】Valency Classification of Mapudungun Verbal Roots. Established by the language's own morphotactics

链接https://arxiv.org/abs/2604.00789

作者:Andrés Chandía

类目:Computation and Language (cs.CL)

关键词:original category accurately, category accurately, Mapudungun roots confirmed, original category, Mapuche verb form

备注

点击查看摘要

Abstract:In the previous work, a lexical (re)categorisation -- or confirmation of the given category -- of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language's own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.

34. 【2604.00778】From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

链接https://arxiv.org/abs/2604.00778

作者:Ayan Datta,Mounika Marreddy,Alexander Mehler,Zhixue Zhao,Radhika Mamidi

类目:Computation and Language (cs.CL)

关键词:Large language models, elementary symbolic tasks, Large language, complex benchmarks, excelling on complex

备注

点击查看摘要

Abstract:Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2604.00778 [cs.CL]

(or
arXiv:2604.00778v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00778

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Ayan Datta [view email] [v1]
Wed, 1 Apr 2026 11:40:12 UTC (1,357 KB)

35. 【2604.00773】From Baselines to Preferences: A Comparative Study of LoRA/QLoRA and Preference Optimization for Mental Health Text Classification

链接https://arxiv.org/abs/2604.00773

作者:Mihael Arcan

类目:Computation and Language (cs.CL)

关键词:rapidly adopted modern, adopted modern adaptation, modern adaptation methods, health text classification, remains limited

备注

点击查看摘要

Abstract:Mental health text classification has rapidly adopted modern adaptation methods, yet practical guidance on which optimization strategy to use, when, and why remains limited. This paper presents a systematic comparative study of optimization pathways for a joint mental-health classification task, moving from strong vanilla baselines to progressively more specialized techniques. We first establish classical and encoder references, then examine parameter-efficient supervised fine-tuning with LoRA/QLoRA under multiple objective and optimization settings, and finally evaluate preference-based optimization with DPO, ORPO, and KTO, including class-rebalanced training. Rather than emphasizing a single headline score, we focus on methodological insight: how performance changes with objective formulation, adapter choice, optimizer behavior, context windowing, and class-balance intervention. The results show that optimization effects are highly method-dependent: some approaches deliver stable, transferable gains, while others are sensitive to configuration and data balance. Preference optimization, in particular, exhibits large variation across objectives, indicating that method selection is more consequential than simply adding a preference-training stage. The central contribution is a clear optimization narrative for mental health NLP: start from transparent baselines, apply controlled tuning, and use preference optimization selectively where its gains are demonstrable. This provides a reproducible and practically grounded framework for choosing effective training strategies beyond architecture choice alone.

36. 【2604.00754】Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

链接https://arxiv.org/abs/2604.00754

作者:Zehao Jin,Yanan Sui

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:fruit fly comprises, average shortest path, neurons connected, whole-brain connectome, fruit fly

备注

点击查看摘要

Abstract:The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

37. 【2604.00722】LangMARL: Natural Language Multi-Agent Reinforcement Learning

链接https://arxiv.org/abs/2604.00722

作者:Huaiyuan Yao,Longchao Da,Xiaoou Liu,Charles Fleming,Tianlong Chen,Hua Wei

类目:Computation and Language (cs.CL)

关键词:Large language model, autonomously evolve coordination, evolve coordination strategies, coarse global outcomes, global outcomes obscure

备注: 20 pages, 12 figures

点击查看摘要

Abstract:Large language model (LLM) agents struggle to autonomously evolve coordination strategies in dynamic environments, largely because coarse global outcomes obscure the causal signals needed for local policy refinement. We identify this bottleneck as a multi-agent credit assignment problem, which has long been studied in classical multi-agent reinforcement learning (MARL) but remains underaddressed in LLM-based systems. Building on this observation, we propose LangMARL, a framework that brings credit assignment and policy gradient evolution from cooperative MARL into the language space. LangMARL introduces agent-level language credit assignment, pioneers gradient evolution in language space for policy improvement, and summarizes task-relevant causal relations from replayed trajectories to provide dense feedback and improve convergence under sparse rewards. Extensive experiments across diverse cooperative multi-agent tasks demonstrate improved sample efficiency, interpretability, and strong generalization.

38. 【2604.00715】o Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

链接https://arxiv.org/abs/2604.00715

作者:Karan Singh,Michael Yu,Varun Gangal,Zhuofu Tao,Sachin Kumar,Emmy Liu,Steven Y. Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:providing relevant context, Retrieval-augmented generation, knowledge-intensive situations, providing relevant, relevant context

备注: Code and data at [this https URL](https://github.com/DegenAI-Labs/RAG-scaling-laws)

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

39. 【2604.00706】AfrIFact: Cultural Information Retrieval, Evidence Extraction and Fact Checking for African Languages

链接https://arxiv.org/abs/2604.00706

作者:Israel Abebe Azime,Jesujoba Oluwadara Alabi,Crystina Zhang,Iffat Maab,Atnafu Lambebo Tonja,Tadesse Destaw Belay,Folasade Peace Alabi,Salomey Osei,Saminu Mohammad Aliyu,Nkechinyere Faith Aguobi,Bontu Fufa Balcha,Blessing Kudzaishe Sibanda,Davis David,Mouhamadane Mboup,Daud Abolade,Neo Putini,Philipp Slusallek,David Ifeoluwa Adelani,Dietrich Klakow

类目:Computation and Language (cs.CL)

关键词:claim made online, Assessing the veracity, real-world implications, made online, complex and important

备注

点击查看摘要

Abstract:Assessing the veracity of a claim made online is a complex and important task with real-world implications. When these claims are directed at communities with limited access to information and the content concerns issues such as healthcare and culture, the consequences intensify, especially in low-resource languages. In this work, we introduce AfrIFact, a dataset that covers the necessary steps for automatic fact-checking (i.e., information retrieval, evidence extraction, and fact checking), in ten African languages and English. Our evaluation results show that even the best embedding models lack cross-lingual retrieval capabilities, and that cultural and news documents are easier to retrieve than healthcare-domain documents, both in large corpora and in single documents. We show that LLMs lack robust multilingual fact-verification capabilities in African languages, while few-shot prompting improves performance by up to 43% in AfriqueQwen-14B, and task-specific fine-tuning further improves fact-checking accuracy by up to 26%. These findings, along with our release of the AfrIFact dataset, encourage work on low-resource information retrieval, evidence retrieval, and fact checking.

40. 【2604.00698】Learning to Hint for Reinforcement Learning

链接https://arxiv.org/abs/2604.00698

作者:Yu Xia,Canwen Xu,Zhewei Yao,Julian McAuley,Yuxiong He

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Relative Policy Optimization, relative advantage, Group Relative Policy, Policy Optimization, advantage collapse

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at this https URL.

41. 【2604.00688】OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

链接https://arxiv.org/abs/2604.00688

作者:Han Zhu,Lingxuan Ye,Wei Kang,Zengwei Yao,Liyong Guo,Fangjun Kuang,Zhifeng Han,Weiji Zhuang,Long Lin,Daniel Povey

类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:massive multilingual zero-shot, TTS, multilingual zero-shot, discrete NAR models, present OmniVoice

备注

点击查看摘要

Abstract:We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at this https URL.

42. 【2604.00672】Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

链接https://arxiv.org/abs/2604.00672

作者:Zeyad Ahmed,Paul Sheridan,Michael McIsaac,Aitazaz A. Farooque

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)

关键词:identifying important terms, classical formula, identifying important, word burstiness, important terms

备注: 27 pages, 3 tables, 7 figures, accepted in Discover Computing 2026

点击查看摘要

Abstract:TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

43. 【2604.00666】RIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models

链接https://arxiv.org/abs/2604.00666

作者:Lingjie Chen,Ruizhong Qiu,Yuyu Fan,Yanjun Zhao,Hanghang Tong

类目:Computation and Language (cs.CL)

关键词:practical efficiency depends, efficiency depends heavily, Masked Diffusion Language, Diffusion language models, Diffusion Language Model

备注: 10 pages, 7 figures, 1 algorithm

点击查看摘要

Abstract:Diffusion language models (DLMs) offer a promising path toward low-latency generation through parallel decoding, but their practical efficiency depends heavily on the decoding trajectory. In practice, this advantage often fails to fully materialize because standard training does not provide explicit supervision over token reveal order, creating a train-inference mismatch that leads to suboptimal decoding behavior. We propose Trajectory-Ranked Instruction Masked Supervision (TRIMS), a simple trajectory-guided supervised fine-tuning framework that injects trajectory supervision into standard Masked Diffusion Language Model (MDLM) training with minimal overhead. Instead of relying on costly DLM-based distillation, TRIMS uses lightweight signals from an autoregressive teacher to guide a trajectory-aware masking strategy, encouraging the model to learn more effective decoding orders. Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive performance with prior distillation-based approaches at substantially lower training cost. Further analysis shows that TRIMS leads to better decoding trajectories, validating the effectiveness of trajectory-guided supervision for DLMs.

44. 【2604.00626】A Survey of On-Policy Distillation for Large Language Models

链接https://arxiv.org/abs/2604.00626

作者:Mingyang Song,Mao Zheng

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, frontier Large Language, Language Models, Large Language, frontier Large

备注

点击查看摘要

Abstract:Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

45. 【2604.00613】English to Central Kurdish Speech Translation: Corpus Creation, Evaluation, and Orthographic Standardization

链接https://arxiv.org/abs/2604.00613

作者:Mohammad Mohammadamini,Daban Q. Jaff,Josep Crego,Marie Tahon,Antoine Laurent

类目:Computation and Language (cs.CL)

关键词:million Central Kurdish, Central Kurdish tokens, Central Kurdish, million English tokens, derived from TED

备注

点击查看摘要

Abstract:We present KUTED, a speech-to-text translation (S2TT) dataset for Central Kurdish, derived from TED and TEDx talks. The corpus comprises 91,000 sentence pairs, including 170 hours of English audio, 1.65 million English tokens, and 1.40 million Central Kurdish tokens. We evaluate KUTED on the S2TT task and find that orthographic variation significantly degrades Kurdish translation performance, producing nonstandard outputs. To address this, we propose a systematic text standardization approach that yields substantial performance gains and more consistent translations. On a test set separated from TED talks, a fine-tuned Seamless model achieves 15.18 BLEU, and we improve Seamless baseline by 3.0 BLEU on the FLEURS benchmark. We also train a Transformer model from scratch and evaluate a cascaded system that combines Seamless (ASR) with NLLB (MT).

46. 【2604.00610】Speech LLMs are Contextual Reasoning Transcribers

链接https://arxiv.org/abs/2604.00610

作者:Keqi Deng,Ruchao Fan,Bo Ren,Yiming Wang,Jinyu Li

类目:Computation and Language (cs.CL)

关键词:large language models, primarily involves direct, task primarily involves, remains non-trivial, automatic speech recognition

备注

点击查看摘要

Abstract:Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

47. 【2604.00586】More Human, More Efficient: Aligning Annotations with Quantized SLMs

链接https://arxiv.org/abs/2604.00586

作者:Jiayu Wang,Junyoung Lee

类目:Computation and Language (cs.CL)

关键词:Large Language Model, exponentially increasing text, increasing text corpora, outpaced human capacity, Large Language

备注

点击查看摘要

Abstract:As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and annotation. However, proprietary LLMs often exhibit systematic biases that diverge from human expert consensus, lacks reproducibility, and raises data privacy concerns. Our work examines the viability of finetuning a quantized Small Language Model of 1.7B parameter size on limited human-annotated data to serve as a highly aligned, deterministic evaluator and annotator. By implementing a custom, multi-dimensional rubric framework and simple augmentation and regularization techniques, the proposed approach achieves higher inter-annotator agreement (0.23 points increase in Krippendorff's $\alpha$) than the best performing state-of-the-art proprietary LLM. We also demonstrate the generalizability of the proposed training pipeline on a separate emotion classification task. The results show that task-specific alignment and efficient 4-bit quantized fine-tuning provide superior open-source alternative to using proprietary models for evaluation and annotation. Our finetuning approach is publicly available at this https URL.

48. 【2604.00568】A Japanese Benchmark for Evaluating Social Bias in Reasoning Based on Attribution Theory

链接https://arxiv.org/abs/2604.00568

作者:Taihei Shiotani,Masahiro Kaneko,Naoaki Okazaki

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, fairness of Large, specific linguistic regions, regions is essential

备注

点击查看摘要

Abstract:In enhancing the fairness of Large Language Models (LLMs), evaluating social biases rooted in the cultural contexts of specific linguistic regions is essential. However, most existing Japanese benchmarks heavily rely on translating English data, which does not necessarily provide an evaluation suitable for Japanese culture. Furthermore, they only evaluate bias in the conclusion, failing to capture biases lurking in the reasoning. In this study, based on attribution theory in social psychology, we constructed a new dataset, ``JUBAKU-v2,'' which evaluates the bias in attributing behaviors to in-groups and out-groups within reasoning while fixing the conclusion. This dataset consists of 216 examples reflecting cultural biases specific to Japan. Experimental results verified that it can detect performance differences across models more sensitively than existing benchmarks.

49. 【2604.00555】Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

链接https://arxiv.org/abs/2604.00555

作者:Thanh Luong Tuan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:Large Language Models, Large Language, adoption of Large, Language Models, enforce regulatory compliance

备注: 23 pages, 7 tables, 4 figures, 33 references. Empirical evaluation: 600 runs across 5 regulated industries including Vietnamese-language domains

点击查看摘要

Abstract:Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework--Role, Domain, and Interaction ontologies--that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.

50. 【2604.00536】Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

链接https://arxiv.org/abs/2604.00536

作者:Zhiting Fan,Ruizhe Chen,Tianxiang Hu,Ru Peng,Zenan Huang,Haokai Xu,Yixin Chen,Jian Wu,Junbo Zhao,Zuozhu Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, abundant supervised fine-tuning, Large language, performance largely due, high-quality SFT data

备注

点击查看摘要

Abstract:Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

51. 【2604.00529】MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference

链接https://arxiv.org/abs/2604.00529

作者:Zifei Xu,Sayeh Sharify,Hesham Mostafa

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Quantization-aware training, choose numerical precision, multi-format QAT, QAT, single target numeric

备注

点击查看摘要

Abstract:Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8), and (iii) allows on-the-fly conversion to lower MXINT or MXFP formats at runtime with negligible-or no-additional accuracy degradation. Together, these components provide a practical path to elastic precision scaling and allow selecting the runtime format at inference time across diverse deployment targets.

52. 【2604.00489】Adapting Text LLMs to Speech via Multimodal Depth Up-Scaling

链接https://arxiv.org/abs/2604.00489

作者:Kazuki Yano,Jun Suzuki,Shinji Watanabe

类目:Computation and Language (cs.CL)

关键词:Large Language Models, text Large Language, Large Language, Speech Language Models, Adapting pre-trained text

备注

点击查看摘要

Abstract:Adapting pre-trained text Large Language Models (LLMs) into Speech Language Models (Speech LMs) via continual pretraining on speech data is promising, but often degrades the original text capabilities. We propose Multimodal Depth Upscaling, an extension of an emerging strategy in continual LLM pre-training, where new transformer layers are inserted into a frozen text LLM and only the added layers are trained on speech data. Experiments with SmolLM2-360M and SmolLM2-1.7B on 48k hours of English Automatic Speech Recognition (ASR) data show that depth up-scaling achieves ASR comparable to full fine-tuning while causing far less text degradation than both full fine-tuning and Low-Rank Adaptation (LoRA). We further show that incorporating E-Branchformer, an architecture designed for speech recognition, as the inserted layers achieves ASR that matches or surpasses full fine-tuning on the larger model while reducing text degradation by over 75% with 60% fewer trainable parameters.

53. 【2604.00477】Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation

链接https://arxiv.org/abs/2604.00477

作者:HyunJoon Jung,William Na

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)

关键词:fundamental uncertainty remains, LLM-based agent judges, uncertainty remains, trust their assessments, emerging approach

备注

点击查看摘要

Abstract:LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.

54. 【2604.00464】Not My Truce: Personality Differences in AI-Mediated Workplace Negotiation

链接https://arxiv.org/abs/2604.00464

作者:Veda Duddu,Jash Rajesh Parekh,Andy Mao,Hanyi Min,Ziang Xiao,Vedant Das Swain,Koustuv Saha

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:prior work assumes, work assumes uniform, AI-driven conversational coaching, assumes uniform effectiveness, AI-driven conversational

备注

点击查看摘要

Abstract:AI-driven conversational coaching is increasingly used to support workplace negotiation, yet prior work assumes uniform effectiveness across users. We challenge this assumption by examining how individual differences, particularly personality traits, moderate coaching outcomes. We conducted a between-subjects experiment (N=267) comparing theory-driven AI (Trucey), general-purpose AI (Control-AI), and a traditional negotiation handbook (Control-NoAI). Participants were clustered into three profiles -- resilient, overcontrolled, and undercontrolled -- based on the Big-Five personality traits and ARC typology. Resilient workers achieved broad psychological gains primarily from the handbook, overcontrolled workers showed outcome-specific improvements with theory-driven AI, and undercontrolled workers exhibited minimal effects despite engaging with the frameworks. These patterns suggest personality as a predictor of readiness beyond stage-based tailoring: vulnerable users benefit from targeted rather than comprehensive interventions. The study advances understanding of personality-determined intervention prerequisites and highlights design implications for adaptive AI coaching systems that align support intensity with individual readiness, rather than assuming universal effectiveness.

55. 【2604.00455】First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

链接https://arxiv.org/abs/2604.00455

作者:Jiwoo Ha,Jongwoo Baek,Jinhyun So

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent Large Vision-Language, Large Vision-Language Models, Recent Large, demonstrated remarkable performance, Large Vision-Language

备注: 19 pages, 13 figures

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at this https URL

56. 【2604.00445】owards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models

链接https://arxiv.org/abs/2604.00445

作者:Ponhvoan Srey,Quang Minh Nguyen,Xiaobao Wu,Anh Tuan Luu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:detect hallucinated outputs, large language models, aims to detect, improve their reliability, detect hallucinated

备注

点击查看摘要

Abstract:Uncertainty estimation (UE) aims to detect hallucinated outputs of large language models (LLMs) to improve their reliability. However, UE metrics often exhibit unstable performance across configurations, which significantly limits their applicability. In this work, we formalise this phenomenon as proxy failure, since most UE metrics originate from model behaviour, rather than being explicitly grounded in the factual correctness of LLM outputs. With this, we show that UE metrics become non-discriminative precisely in low-information regimes. To alleviate this, we propose Truth AnChoring (TAC), a post-hoc calibration method to remedy UE metrics, by mapping the raw scores to truth-aligned scores. Even with noisy and few-shot supervision, our TAC can support the learning of well-calibrated uncertainty estimates, and presents a practical calibration protocol. Our findings highlight the limitations of treating heuristic UE metrics as direct indicators of truth uncertainty, and position our TAC as a necessary step toward more reliable uncertainty estimation for LLMs. The code repository is available at this https URL.

57. 【2604.00443】Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

链接https://arxiv.org/abs/2604.00443

作者:Iyad Ait Hou,Rebecca Hwa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:standard metrics attribute, standard metrics, metrics attribute, compressing two unrelated, unrelated concepts

备注: 21 pages

点击查看摘要

Abstract:If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in =1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

58. 【2604.00442】Execution-Verified Reinforcement Learning for Optimization Modeling

链接https://arxiv.org/abs/2604.00442

作者:Runda Guan,Xiangqing Shen,Jiajun Zhang,Yifan Zhang,Jian Cheng,Rui Xia

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Automating optimization modeling, single solver API, fine-tune smaller LLMs, scalable decision intelligence, high inference latency

备注

点击查看摘要

Abstract:Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller LLMs using costly process supervision that often overfits to a single solver API. Inspired by reinforcement learning with verifiable rewards, we propose Execution-Verified Optimization Modeling (EVOM), an execution-verified learning framework that treats a mathematical programming solver as a deterministic, interactive verifier. Given a natural-language problem and a target solver, EVOM generates solver-specific code, executes it in a sandboxed harness, and converts execution outcomes into scalar rewards, optimized with GRPO and DAPO in a closed-loop generate-execute-feedback-update process. This outcome-only formulation removes the need for process-level supervision, and enables cross-solver generalization by switching the verification environment rather than reconstructing solver-specific datasets. Experiments on NL4OPT, MAMO, IndustryOR, and OptiBench across Gurobi, OR-Tools, and COPT show that EVOM matches or outperforms process-supervised SFT, supports zero-shot solver transfer, and achieves effective low-cost solver adaptation by continuing training under the target solver backend.

59. 【2604.00438】R-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

链接https://arxiv.org/abs/2604.00438

作者:Wenxuan Jiang,Yuxin Zuo,Zijian Zhang,Xuecheng Wu,Zining Fan,Wenxuan Liu,Li Chen,Xiaoyu Li,Xuezhi Cao,Xiaolong Jin,Ninghao Liu

类目:Computation and Language (cs.CL)

关键词:enables Large Language, Large Language Models, In-Context Reinforcement Learning, Large Language, Reinforcement Learning

备注: 14 pages, 7 figures

点击查看摘要

Abstract:In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at this https URL.

60. 【2604.00375】Locally Confident, Globally Stuck: The Quality-Exploration Dilemma in Diffusion Language Models

链接https://arxiv.org/abs/2604.00375

作者:Liancheng Fang,Aiwei Liu,Henry Peng Zou,Yankai Chen,Enze Ma,Leyi Pan,Chunyu Miao,Wei-Chieh Huang,Xue Liu,Philip S. Yu

类目:Computation and Language (cs.CL)

关键词:Diffusion large language, large language models, Diffusion large, theoretically permit token, enable richer exploration

备注

点击查看摘要

Abstract:Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality--exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis--Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.

61. 【2604.00356】Signals: Trajectory Sampling and Triage for Agentic Interactions

链接https://arxiv.org/abs/2604.00356

作者:Shuguang Chen,Adil Hafeez,Salman Paracha

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Agentic applications based, loops involving planning, large language models, language models increasingly, models increasingly rely

备注

点击查看摘要

Abstract:Agentic applications based on large language models increasingly rely on multi-step interaction loops involving planning, action execution, and environment feedback. While such systems are now deployed at scale, improving them post-deployment remains challenging. Agent trajectories are voluminous and non-deterministic, and reviewing each one, whether through human review or auxiliary LLMs, is slow and cost-prohibitive. We propose a lightweight, signal-based framework for triaging agentic interaction trajectories. Our approach computes cheap, broadly applicable signals from live interactions and attaches them as structured attributes for trajectory triage, identifying interactions likely to be informative without affecting online agent behavior. We organize signals into a coarse-grained taxonomy spanning interaction (misalignment, stagnation, disengagement, satisfaction), execution (failure, loop), and environment (exhaustion), designed for computation without model calls. In a controlled annotation study on $\tau$-bench, a widely used benchmark for tool-augmented agent evaluation, we show that signal-based sampling achieves an 82\% informativeness rate compared to 74\% for heuristic filtering and 54\% for random sampling, with a 1.52x efficiency gain per informative trajectory. The advantage is robust across reward strata and task domains, confirming that signals provide genuine per-trajectory informativeness gains rather than merely oversampling obvious failures. These results show that lightweight signals can serve as practical sampling infrastructure for agentic systems, and suggest a path toward preference data construction and post-deployment optimization.

62. 【2604.00344】Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

链接https://arxiv.org/abs/2604.00344

作者:Eric Hanchen Jiang,Levina Li,Rui Sun,Xiao Liang,Yubei Li,Yuchen Wu,Haozheng Luo,Hengli Li,Zhi Zhang,Zhaolu Kang,Kai-Wei Chang,Ying Nian Wu

类目:Computation and Language (cs.CL); Applications (stat.AP)

关键词:Large Language Models, Large Language, Language Models, shown remarkable performance, Agent Q-Mix

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

63. 【2604.00323】Large Language Models in the Abuse Detection Pipeline

链接https://arxiv.org/abs/2604.00323

作者:Suraj Kath,Sanket Badhe,Preet Shah,Ashwin Sampathkumar,Shivani Gupta

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:grown increasingly complex, spanning toxic language, Online abuse, increasingly complex, spanning toxic

备注

点击查看摘要

Abstract:Online abuse has grown increasingly complex, spanning toxic language, harassment, manipulation, and fraudulent behavior. Traditional machine-learning approaches dependent on static classifiers and labor-intensive labeling struggle to keep pace with evolving threat patterns and nuanced policy requirements. Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems. This survey provides a lifecycle-oriented analysis of how LLMs are being integrated into the Abuse Detection Lifecycle (ADL), which we define across four stages: (I) Label \ Feature Generation, (II) Detection, (III) Review \ Appeals, and (IV) Auditing \ Governance. For each stage, we synthesize emerging research and industry practices, highlight architectural considerations for production deployment, and examine the strengths and limitations of LLM-driven approaches. We conclude by outlining key challenges including latency, cost-efficiency, determinism, adversarial robustness, and fairness and discuss future research directions needed to operationalize LLMs as reliable, accountable components of large-scale abuse-detection and governance systems.

64. 【2604.00304】Asymmetric Actor-Critic for Multi-turn LLM Agents

链接https://arxiv.org/abs/2604.00304

作者:Shuli Jiang,Zhaoyang Zhang,Yi Zhang,Shuo Yang,Wei Xia,Stefano Soatto

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:interactions remains challenging, ensuring reliable behavior, exhibit strong reasoning, multi-turn interactions remains, remains challenging

备注: 19 pages

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $\tau$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.

65. 【2604.00291】Frege in the Flesh: Biolinguistics and the Neural Enforcement of Syntactic Structures

链接https://arxiv.org/abs/2604.00291

作者:Elliot Murphy

类目:Computation and Language (cs.CL)

关键词:interdisciplinary scientific study, human language acquisition, human language, genetic basis, basis of human

备注

点击查看摘要

Abstract:Biolinguistics is the interdisciplinary scientific study of the biological foundations, evolution, and genetic basis of human language. It treats language as an innate biological organ or faculty of the mind, rather than a cultural tool, and it challenges a behaviorist conception of human language acquisition as being based on stimulus-response associations. Extracting its most essential component, it takes seriously the idea that mathematical, algebraic models of language capture something natural about the world. The syntactic structure-building operation of MERGE is thought to offer the scientific community a "real joint of nature", "a (new) aspect of nature" (Mukherji 2010), not merely a formal artefact. This mathematical theory of language is then seen as being able to offer biologists, geneticists and neuroscientists clearer instructions for how to explore language. The argument of this chapter proceeds in four steps. First, I clarify the object of inquiry for biolinguistics: not speech, communication, or generic sequence processing, but the internal computational system that generates hierarchically structured expressions. Second, I argue that this formal characterization matters for evolutionary explanation, because different conceptions of syntax imply different standards of what must be explained. Third, I suggest that a sufficiently explicit algebraic account of syntax places non-trivial constraints on candidate neural mechanisms. Finally, I consider how recent neurocomputational work begins to transform these constraints into empirically tractable hypotheses, while also noting the speculative and revisable character of the present program.

66. 【2604.00261】Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

链接https://arxiv.org/abs/2604.00261

作者:Zaifu Zhan,Mengyuan Cui,Rui Zhang

类目:Computation and Language (cs.CL)

关键词:Large language models, settings remains unclear, eliciting explicit intermediate, safety-critical medical settings, medical settings remains

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.

67. 【2604.00259】LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias

链接https://arxiv.org/abs/2604.00259

作者:Filip J. Kucia,Anirban Chakraborty,Anna Wróblewska

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Language Models, educational assessment, analytic scoring

备注

点击查看摘要

Abstract:Despite growing interest in using Large Language Models (LLMs) for educational assessment, it remains unclear how closely they align with human scoring. We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring. We analyze agreement with human consensus scores, directional bias, and the stability of bias estimates. Our results show that strong open-weight models achieve moderate to high agreement with humans on holistic scoring (Quadratic Weighted Kappa about 0.6), but this does not transfer uniformly to analytic scoring. In particular, we observe large and stable negative directional bias on Lower-Order Concern (LOC) traits, such as Grammar and Conventions, meaning that models often score these traits more harshly than human raters. We also find that concise keyword-based prompts generally outperform longer rubric-style prompts in multi-trait analytic scoring. To quantify the amount of data needed to detect these systematic deviations, we compute the minimum sample size at which a 95% bootstrap confidence interval for the mean bias excludes zero. This analysis shows that LOC bias is often detectable with very small validation sets, whereas Higher-Order Concern (HOC) traits typically require much larger samples. These findings support a bias-correction-first deployment strategy: instead of relying on raw zero-shot scores, systematic score offsets can be estimated and corrected using small human-labeled bias-estimation sets, without requiring large-scale fine-tuning.

68. 【2604.00248】REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

链接https://arxiv.org/abs/2604.00248

作者:Pawin Taechoyotin,Daniel E. Acuna

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:leaving visual elements, scholarly signals underutilized, external scholarly signals, automated peer review, textual manuscript content

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

69. 【2604.00242】FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

链接https://arxiv.org/abs/2604.00242

作者:Antonín Jarolím,Martin Fajčík

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:identifies relevant documents, specific relevant spans, Document retrieval identifies, fine-grained evidence cues, provide fine-grained evidence

备注

点击查看摘要

Abstract:Document retrieval identifies relevant documents but does not provide fine-grained evidence cues, such as specific relevant spans. A possible solution is to apply an LLM after retrieval; however, this introduces significant computational overhead and limits practical deployment. We propose FGR-ColBERT, a modification of ColBERT retrieval model that integrates fine-grained relevance signals distilled from an LLM directly into the retrieval function. Experiments on MS MARCO show that FGR-ColBERT (110M) achieves a token-level F1 of 64.5, exceeding the 62.8 of Gemma 2 (27B), despite being approximately 245 times smaller. At the same time, it preserves retrieval effectiveness (99% relative Recall@50) and remains efficient, incurring only a ~1.12x latency overhead compared to the original ColBERT.

70. 【2604.00239】A Taxonomy of Programming Languages for Code Generation

链接https://arxiv.org/abs/2604.00239

作者:Nishat Raihan,Christian Newman,Marcos Zampieri

类目:Computation and Language (cs.CL)

关键词:languages vary widely, motivating efforts, degree of resourcefulness, vary widely, efforts to systematically

备注

点击查看摘要

Abstract:The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.

71. 【2604.00228】Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

链接https://arxiv.org/abs/2604.00228

作者:Tanay Gondil

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, refuse harmful requests, Claude Sonnet, Sonnet

备注: 11 pages, 5 figures

点击查看摘要

Abstract:Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

72. 【2604.00209】Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations

链接https://arxiv.org/abs/2604.00209

作者:Haoran Wang,Li Xiong,Kai Shu

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, frequently violate contextual, contextual privacy, high-stakes settings

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in high-stakes settings, yet they frequently violate contextual privacy by disclosing private information in situations where humans would exercise discretion. This raises a fundamental question: do LLMs internally encode contextual privacy norms, and if so, why do violations persist? We present the first systematic study of contextual privacy as a structured latent representation in LLMs, grounded in contextual integrity (CI) theory. Probing multiple models, we find that the three norm-determining CI parameters (information type, recipient, and transmission principle) are encoded as linearly separable and functionally independent directions in activation space. Despite this internal structure, models still leak private information in practice, revealing a clear gap between concept representation and model behavior. To bridge this gap, we introduce CI-parametric steering, which independently intervenes along each CI dimension. This structured control reduces privacy violations more effectively and predictably than monolithic steering. Our results demonstrate that contextual privacy failures arise from misalignment between representation and behavior rather than missing awareness, and that leveraging the compositional structure of CI enables more reliable contextual privacy control, shedding light on potential improvement of contextual privacy understanding in LLMs.

73. 【2604.00174】Polish phonology and morphology through the lens of distributional semantics

链接https://arxiv.org/abs/2604.00174

作者:Paula Orzechowska,R. Harald Baayen

类目:Computation and Language (cs.CL)

关键词:Distributional Semantics, study investigates, meanings using Distributional, Linear Discriminant Analysis, phonological and morphological

备注

点击查看摘要

Abstract:This study investigates the relationship between the phonological and morphological structure of Polish words and their meanings using Distributional Semantics. In the present analysis, we ask whether there is a relationship between the form properties of words containing consonant clusters and their meanings. Is the phonological and morphonological structure of complex words mirrored in semantic space? We address these questions for Polish, a language characterized by non-trivial morphology and an impressive inventory of morphologically-motivated consonant clusters. We use statistical and computational techniques, such as t-SNE, Linear Discriminant Analysis and Linear Discriminative Learning, and demonstrate that -- apart from encoding rich morphosyntactic information (e.g. tense, number, case) -- semantic vectors capture information on sub-lexical linguistic units such as phoneme strings. First, phonotactic complexity, morphotactic transparency, and a wide range of morphosyntactic categories available in Polish (case, gender, aspect, tense, number) can be predicted from embeddings without requiring any information about the forms of words. Second, we argue that computational modelling with the discriminative lexicon model using embeddings can provide highly accurate predictions for comprehension and production, exactly because of the existence of extensive information in semantic space that is to a considerable extent isomorphic with structure in the form space.

74. 【2604.00136】ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving

链接https://arxiv.org/abs/2604.00136

作者:Annette Taberner-Miller

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Production LLM serving, Production LLM, LLM serving, multi-model portfolios spanning, serving often relies

备注: 27 pages, 15 figures, 13 tables. Code available at [this https URL](https://github.com/ParetoBandit/ParetoBandit)

点击查看摘要

Abstract:Production LLM serving often relies on multi-model portfolios spanning a ~530x cost range, where routing decisions trade off quality against cost. This trade-off is non-stationary: providers revise pricing, model quality can regress silently, and new models must be integrated without downtime. We present ParetoBandit, an open-source adaptive router built on cost-aware contextual bandits that is the first to simultaneously enforce dollar-denominated budgets, adapt online to such shifts, and onboard new models at runtime. ParetoBandit closes these gaps through three mechanisms. An online primal-dual budget pacer enforces a per-request cost ceiling over an open-ended stream, replacing offline penalty tuning with closed-loop control. Geometric forgetting on sufficient statistics enables rapid adaptation to price and quality shifts while bootstrapping from offline priors. A hot-swap registry lets operators add or remove models at runtime, with a brief forced-exploration phase for each newcomer, after which UCB selection discovers its quality-cost niche from live traffic alone. We evaluate ParetoBandit across four deployment scenarios on 1,824 prompts routed through a three-model portfolio. Across seven budget ceilings, mean per-request cost never exceeds the target by more than 0.4%. When conditions shift, the system adapts: an order-of-magnitude price cut on the costliest model yields up to +0.071 quality lift, and a silent quality regression is detected and rerouted within budget. A cold-started model reaches meaningful adoption within ~142 steps without breaching the cost ceiling. The router discriminates rather than blindly adopting: expensive models are budget-gated and low-quality models rejected after bounded exploration. End-to-end routing latency is 9.8ms on CPU -- less than 0.4% of typical inference time -- with the routing decision itself taking just 22.5us.

Comments:
27 pages, 15 figures, 13 tables. Code available at this https URL

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

MSC classes:
68T05, 62L05

ACMclasses:
I.2.6; I.2.11; C.4

Cite as:
arXiv:2604.00136 [cs.LG]

(or
arXiv:2604.00136v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.00136

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
75. 【2604.00131】Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation

链接https://arxiv.org/abs/2604.00131

作者:Ashish Rana,Chia-Chien Hung,Qumeng Sun,Julian Martin Kunkel,Carolin Lawrence

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human memory adapts, Human memory, contextual cues, accessible over time, memory

备注: 7 pages, 2 figures, and 4 tables

点击查看摘要

Abstract:Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues. In contrast, memory-augmented LLM agents rely on "always-on" retrieval and "flat" memory storage, causing high interference and latency as histories grow. We introduce Oblivion, a memory control framework that casts forgetting as decay-driven reductions in accessibility, not explicit deletion. Oblivion decouples memory control into read and write paths. The read path decides when to consult memory, based on agent uncertainty and memory buffer sufficiency, avoiding redundant always-on access. The write path decides what to strengthen, by reinforcing memories contributing to forming the response. Together, this enables hierarchical memory organization that maintains persistent high-level strategies while dynamically loading details as needed. We evaluate on both static and dynamic long-horizon interaction benchmarks. Results show that Oblivion dynamically adapts memory access and reinforcement, balancing learning and forgetting under shifting contexts, highlighting that memory control is essential for effective LLM-agentic reasoning. The source code is available at this https URL.

76. 【2604.00130】Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency

链接https://arxiv.org/abs/2604.00130

作者:Xingshuai Huang,Derek Li,Bahareh Nikpour,Parsa Omidi

类目:Computation and Language (cs.CL)

关键词:large language models, significantly improved, capabilities of large, large language, reasoning

备注

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly improved the reasoning capabilities of large language models (LLMs). However, conventional CoT often relies on unstructured, flat reasoning chains that suffer from redundancy and suboptimal performance. In this work, we introduce Hierarchical Chain-of-Thought (Hi-CoT) prompting, a structured reasoning paradigm specifically designed to address the challenges of complex, multi-step reasoning. Hi-CoT decomposes the reasoning process into hierarchical substeps by alternating between instructional planning and step-by-step execution. This decomposition enables LLMs to better manage long reasoning horizons and maintain logical coherence. Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9% compared to CoT prompting. We further show that accuracy and efficiency are maximized when models strictly adhere to the hierarchical structure. Our code is available at this https URL.

77. 【2604.00086】Hierarchical Pre-Training of Vision Encoders with Large Language Models

链接https://arxiv.org/abs/2604.00086

作者:Eugene Lee,Ting-Yu Chang,Jui-Huang Tsai,Jiajie Diao,Chen-Yi Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:experienced significant advancements, scalable vision encoders, vision encoders, vision encoder, treat vision encoders

备注: 17 pages, 14 figures, accepted to Computer Vision and Pattern Recognition Conference (CVPR) Workshops 2026. 5th MMFM Workshop: What is Next in Multimodal Foundation Models?

点击查看摘要

Abstract:The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.

78. 【2604.00085】One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

链接https://arxiv.org/abs/2604.00085

作者:Yuxing Lu,Yushuhong Lin,Jason Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Large language models, exhibit case-level heterogeneity, yield consistent outputs, Large language, language models applied

备注

点击查看摘要

Abstract:Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

79. 【2604.00073】rminal Agents Suffice for Enterprise Automation

链接https://arxiv.org/abs/2604.00073

作者:Patrice Bechard,Orlando Marquez Ayala,Emily Chen,Jordan Skelton,Sagar Davasam,Srinivas Sunkara,Vikas Yadav,Sai Rajeswar

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Model Context Protocol, execute meaningful enterprise, growing interest, interest in building, interact with digital

备注: Pre-print. Under review for COLM2026

点击查看摘要

Abstract:There has been growing interest in building agents that can interact with digital platforms to execute meaningful enterprise tasks autonomously. Among the approaches explored are tool-augmented agents built on abstractions such as Model Context Protocol (MCP) and web agents that operate through graphical interfaces. Yet, it remains unclear whether such complex agentic systems are necessary given their cost and operational overhead. We argue that a coding agent equipped only with a terminal and a filesystem can solve many enterprise tasks more effectively by interacting directly with platform APIs. We evaluate this hypothesis across diverse real-world systems and show that these low-level terminal agents match or outperform more complex agent architectures. Our findings suggest that simple programmatic interfaces, combined with strong foundation models, are sufficient for practical enterprise automation.

80. 【2604.00027】Multi-lingual Multi-institutional Electronic Health Record based Predictive Model

链接https://arxiv.org/abs/2604.00027

作者:Kyunghoon Hur,Heeyoung Kwak,Jinsu Jang,Nakhwan Kim,Edward Choi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large-scale EHR prediction, Large-scale EHR, Common Data Models, code systems, Common Data

备注: On revision stage, 10 main pages, 3 supplementary pages

点击查看摘要

Abstract:Large-scale EHR prediction across institutions is hindered by substantial heterogeneity in schemas and code systems. Although Common Data Models (CDMs) can standardize records for multi-institutional learning, the manual harmonization and vocabulary mapping are costly and difficult to scale. Text-based harmonization provides an alternative by converting raw EHR into a unified textual form, enabling pooled learning without explicit standardization. However, applying this paradigm to multi-national datasets introduces an additional layer of heterogeneity, which is "language" that must be addressed for truly scalable EHRs learning. In this work, we investigate multilingual multi-institutional learning for EHR prediction, aiming to enable pooled training across multinational ICU datasets without manual standardization. We compare two practical strategies for handling language barriers: (i) directly modeling multilingual records with multilingual encoders, and (ii) translating non-English records into English via LLM-based word-level translation. Across seven public ICU datasets, ten clinical tasks with multiple prediction windows, translation-based lingual alignment yields more reliable cross-dataset performance than multilingual encoders. The multi-institutional learning model consistently outperforms strong baselines that require manual feature selection and harmonization, and also surpasses single-dataset training. We further demonstrate that text-based framework with lingual alignment effectively performs transfer learning via few-shot fine-tuning, with additional gains. To our knowledge, this is the first study to aggregate multilingual multinational ICU EHR datasets into one predictive model, providing a scalable path toward language-agnostic clinical prediction and future global multi-institutional EHR research.

81. 【2604.00026】"Who Am I, and Who Else Is Here?" Behavioral Differentiation Without Role Assignment in Multi-Agent LLM Systems

链接https://arxiv.org/abs/2604.00026

作者:Houssam EL Kandoussi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:multiple large language, develop differentiated social, differentiated social roles, large language models, language models interact

备注: 9 pages, 11 figures, 5 tables

点击查看摘要

Abstract:When multiple large language models interact in a shared conversation, do they develop differentiated social roles or converge toward uniform behavior? We present a controlled experimental platform that orchestrates simultaneous multi-agent discussions among 7 heterogeneous LLMs on a unified inference backend, systematically varying group composition, naming conventions, and prompt structure across 12 experimental series (208 runs, 13,786 coded messages). Each message is independently coded on six behavioral flags by two LLM judges from distinct model families (Gemini 3.1 Pro and Claude Sonnet 4.6), achieving mean Cohen's kappa = 0.78 with conservative intersection-based adjudication. Human validation on 609 randomly stratified messages confirmed coding reliability (mean kappa = 0.73 vs. Gemini). We find that (1) heterogeneous groups exhibit significantly richer behavioral differentiation than homogeneous groups (cosine similarity 0.56 vs. 0.85; p 10^-5, r = 0.70); (2) groups spontaneously exhibit compensatory response patterns when an agent crashes; (3) revealing real model names significantly increases behavioral convergence (cosine 0.56 to 0.77, p = 0.001); and (4) removing all prompt scaffolding converges profiles to homogeneous-level similarity (p 0.001). Critically, these behaviors are absent when agents operate in isolation, confirming that behavioral diversity is a structured, reproducible phenomenon driven by the interaction of architectural heterogeneity, group context, and prompt-level scaffolding.

82. 【2604.00025】Brevity Constraints Reverse Performance Hierarchies in Language Models

链接https://arxiv.org/abs/2604.00025

作者:MD Azizul Hakim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Standard evaluation protocols, larger language models, Standard evaluation, language models underperform, models underperform smaller

备注

点击查看摘要

Abstract:Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.

83. 【2604.00024】WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics

链接https://arxiv.org/abs/2604.00024

作者:Sneha Maurya,Pragya Saboo,Girish Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Large language models, health remains under-evaluated, Large language, women health remains, women health

备注

点击查看摘要

Abstract:Large language models are increasingly used for medical guidance, but women's health remains under-evaluated in benchmark design. We present the Women's Health Benchmark (WHBench), a targeted evaluation suite of 47 expert-crafted scenarios across 10 women's health topics, designed to expose clinically meaningful failure modes including outdated guidelines, unsafe omissions, dosing errors, and equity-related blind spots. We evaluate 22 models using a 23-criterion rubric spanning clinical accuracy, completeness, safety, communication quality, instruction following, equity, uncertainty handling, and guideline adherence, with safety-weighted penalties and server-side score recalculation. Across 3,102 attempted responses (3,100 scored), no model mean performance exceeds 75 percent; the best model reaches 72.1 percent. Even top models show low fully correct rates and substantial variation in harm rates. Inter-rater reliability is moderate at the response label level but high for model ranking, supporting WHBench utility for comparative system evaluation while highlighting the need for expert oversight in clinical deployment. WHBench provides a public, failure-mode-aware benchmark to track safer and more equitable progress in womens health AI.

84. 【2604.00023】Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

链接https://arxiv.org/abs/2604.00023

作者:Mukhlis Amien,Go Frendi Gunawan

类目:Computation and Language (cs.CL)

关键词:non-conforming vocabulary represents, Basic Vocabulary Database, Austronesian Basic Vocabulary, Basic vocabulary, phonological patterns inconsistent

备注: 31 pages, 4 figures, 5 tables. Submitted to Oceanic Linguistics

点击查看摘要

Abstract:Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.

85. 【2604.00022】Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

链接https://arxiv.org/abs/2604.00022

作者:Liang Chen,Qi Liu,Wenhuan Lin,Feng Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains largely untested, Multi-dimensional rubric-based dialogue, Multi-dimensional rubric-based, meant to serve, remains largely

备注

点击查看摘要

Abstract:Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

86. 【2604.00021】How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

链接https://arxiv.org/abs/2604.00021

作者:Hiroki Fukui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Alignment safety research, instructions remains unknown, improve model behavior, models internally process, safety research assumes

备注: 34 pages, 7 figures, 4 tables. Preprint. OSF pre-registration: [this http URL](http://osf.io/4n5uf) . Companion paper: [arXiv:2603.04904](https://arxiv.org/abs/2603.04904)

点击查看摘要

Abstract:Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

87. 【2604.00020】Detecting Abnormal User Feedback Patterns through Temporal Sentiment Aggregation

链接https://arxiv.org/abs/2604.00020

作者:Yalun Qi,Sichen Zhao,Zhiming Xue,Xianling Zeng,Zihan Yu

类目:Computation and Language (cs.CL)

关键词:brand reputation management, product health tracking, malicious review campaigns, brand reputation, reputation management

备注

点击查看摘要

Abstract:In many real-world applications, such as customer feedback monitoring, brand reputation management, and product health tracking, understanding the temporal dynamics of user sentiment is crucial for early detection of anomalous events such as malicious review campaigns or sudden declines in user satisfaction. Traditional sentiment analysis methods focus on individual text classification, which is insufficient to capture collective behavioral shifts over time due to inherent noise and class imbalance in short user comments. In this work, we propose a temporal sentiment aggregation framework that leverages pretrained transformer-based language models to extract per-comment sentiment signals and aggregates them into time-window-level scores. Significant downward shifts in these aggregated scores are interpreted as potential anomalies in user feedback patterns. We adopt RoBERTa as our core semantic feature extractor and demonstrate, through empirical evaluation on real social media data, that the aggregated sentiment scores reveal meaningful trends and support effective anomaly detection. Experiments on real-world social media data demonstrate that our method successfully identifies statistically significant sentiment drops that correspond to coherent complaint patterns, providing an effective and interpretable solution for feedback anomaly monitoring.

88. 【2604.00019】he Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation

链接https://arxiv.org/abs/2604.00019

作者:Pavel Braslavski,Dmitrii Iarosh,Nikita Sushko,Andrey Sakhovskiy,Vasily Konovalov,Elena Tutubalina,Alexander Panchenko

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generating multilingual sets, English and Chinese, Wikipedia and Wikidata, configurable pipeline, pipeline for generating

备注: Accepted to LREC 2026

点击查看摘要

Abstract:We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs' long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.

89. 【2604.00018】hink Twice Before You Write -- an Entropy-based Decoding Strategy to Enhance LLM Reasoning

链接https://arxiv.org/abs/2604.00018

作者:Jiashu He,Meizhu Liu,Olaitan P Olaleye,Amit Agarwal,M. Avendi,Yassi Abbasi,Matthew Rowe,Hitesh Laxmichand Patel,Paul Li,Tao Sheng,Sujith Ravi,Dan Roth

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Decoding strategies play, large language models, strategies play, play a central, central role

备注

点击查看摘要

Abstract:Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches introduce randomness without adequate robustness. Self-consistency improves reliability by aggregating multiple rollouts, but incurs significant computational overhead. We propose an entropy-guided decoding framework that introduces token-level adaptivity into generation. At each step, the model computes the entropy of the token distribution, identifies high-uncertainty positions, and selectively branches on these vulnerable points. A dynamic pool of partial rollouts is maintained and expanded until solutions are completed, concentrating computation where uncertainty is greatest and avoiding unnecessary exploration in confident regions. To enable efficient termination, we apply a rollout-level Entropy After /Think (EAT) stopping criterion by performing entropy evaluation after the full reasoning trace, rather than incrementally at every step. Experiments on GSM8K, AMC2023, and their perturbed variants demonstrate that our method achieves consistently strong accuracy. Notably, on smaller LLMs, performance is comparable to GPT-5 while operating at a fraction of the cost.

90. 【2604.00017】Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora

链接https://arxiv.org/abs/2604.00017

作者:Orlova Anastasia

类目:Computation and Language (cs.CL)

关键词:applied to Russian-language, Russian-language corpora, article examines semantic, Saint Petersburg University, distributional semantics applied

备注

点击查看摘要

Abstract:This article examines semantic shifts in psychological concepts across scientific and popular media discourse using methods of distributional semantics applied to Russian-language corpora. Two corpora were compiled: a scientific corpus of approximately 300 research articles from the journals Psychology. Journal of the Higher School of Economics and Vestnik of Saint Petersburg University. Psychology (767,543 tokens) and a popular science corpus consisting of texts from the online psychology platforms Yasno and Chistye kogntsii (1,199,150 tokens). After preprocessing (OCR recognition, lemmatization, removal of stop words and non-informative characters), the corpora were analyzed through frequency analysis, clustering, and the identification of semantic associations. The results reveal significant differences in vocabulary and conceptual framing between the two discourse types: scientific texts emphasize methodological and clinical terminology, while popular science materials foreground everyday experience and therapeutic practice. A comparison of semantic associations for key concepts such as burnout and depression shows that scientific discourse links these terms to psychological resources, symptomatology, and diagnostic constructs, whereas popular science discourse frames them through personal narratives, emotions, and everyday situations. These findings demonstrate a clear shift from precise professional terminology toward more generalized and experiential meanings in popular media discourse and confirm the effectiveness of distributional semantics methods for identifying semantic transformations of psychological concepts across different communicative contexts.

91. 【2604.00016】Are they human? Detecting large language models by probing human memory constraints

链接https://arxiv.org/abs/2604.00016

作者:Simon Schug,Brenden M. Lake

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:behavioral research relies, online behavioral research, relies on study, behavioral research, online behavioral

备注: Code available at [this https URL](https://github.com/smonsays/llm-humanness)

点击查看摘要

Abstract:The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.

92. 【2604.00015】ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

链接https://arxiv.org/abs/2604.00015

作者:Serry Sibaee,Khloud Al Jallad,Zineb Yousfi,Israa Elsayed Elhosiny,Yousra El-Ghawi,Batool Balah,Omer Nacar

类目:Computation and Language (cs.CL)

关键词:human validation pipeline, high-quality English-Arabic parallel, systematic multi-engine translation, English-Arabic parallel benchmark, Advanced Translation

备注

点击查看摘要

Abstract:We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \texttt{quickmt-en-ar}), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus contains 67,293 English tokens and 60,026 Arabic tokens, with an Arabic vocabulary of 17,604 unique words reflecting the morphological richness of the language. We benchmark three state-of-the-art LLMs on the corpus GPT-4o-mini (BLEU: 37.07), Gemini-3.0-Flash-Preview (BLEU: 30.44), and Qwen3-235B-A22B (BLEU: 23.68) demonstrating its discriminative power as an evaluation benchmark. ASCAT addresses a critical gap in scientific MT resources for Arabic and is designed to support rigorous evaluation of scientific translation quality and training of domain-specific translation models.

93. 【2604.00014】Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

链接https://arxiv.org/abs/2604.00014

作者:Congning Ni,Sarvech Qadir,Bryan Steitz,Mihir Sachin Vaidya,Qingyuan Song,Lantian Xia,Shelagh Mulvaney,Siru Liu,Hyeyoung Ryu,Leah Hecht,Amy Bucher,Christopher Symons,Laurie Novak,Susannah L. Rose,Murat Kantarcioglu,Bradley Malin,Zhijun Yin

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Mental health concerns, Mental health, mental health question, health concerns, Consumer health informatics

备注: Submitted to AMIA 2026 Annual Symposium (under review)

点击查看摘要

Abstract:Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.

94. 【2604.00013】MSA-Thinker: Discrimination-Calibration Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

链接https://arxiv.org/abs/2604.00013

作者:Miaosen Luo,Zhenhao Yang,Jieshen Long,Jinghu Sun,Yichu Liu,Sijie Mai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, understand human emotions, Hint-based Reinforcement Learning, Reinforcement Learning, Large Language Models

备注

点击查看摘要

Abstract:Multimodal sentiment analysis aims to understand human emotions by integrating textual, auditory, and visual modalities. Although Multimodal Large Language Models (MLLMs) have achieved state-of-the-art performance via supervised fine-tuning (SFT), their end-to-end "black-box" nature limits interpretability. Existing methods incorporating Chain-of-Thought (CoT) reasoning are hindered by high annotation costs, while Reinforcement Learning (RL) faces challenges such as low exploration efficiency and sparse rewards, particularly on hard samples. To address these issues, we propose a novel training framework that integrates structured Discrimination-Calibration (DC) reasoning with Hint-based Reinforcement Learning. First, we perform cold-start SFT using high-quality CoT data synthesized by a teacher model (Qwen3Omni-30B), which inherently contains the DC structure. This equips the model with a reasoning paradigm that performs macro discrimination followed by fine-grained calibration from the initial stage. Building on this, we propose Hint-GRPO, which leverages the discrimination phase within the DC structure as a verifiable anchor during RL to provide directional hints for hard samples, guiding policy optimization and effectively mitigating the reward sparsity problem. Experiments on the Qwen2.5Omni-7B model demonstrate that our method not only achieves higher accuracy in fine-grained sentiment regression tasks but also generates high-quality structured reasoning chains. Crucially, it exhibits superior generalization capability in cross-domain evaluations. This enhances model interpretability while validating the positive contribution of explicit reasoning steps to model robustness, offering a new paradigm for building trustworthy and efficient sentiment analysis systems.

95. 【2604.00012】Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

链接https://arxiv.org/abs/2604.00012

作者:Mingjie Li,Wai Man Si,Michael Backes,Yang Zhang,Yisen Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, general-purpose large language, large language, general large language, specific tasks

备注

点击查看摘要

Abstract:Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs' safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.

96. 【2604.00010】Can LLMs Perceive Time? An Empirical Investigation

链接https://arxiv.org/abs/2604.00010

作者:Aniketh Garikaparthi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, Large, Abstract, language models

备注: ICLR 2026 I Can't Believe It's Not Better Workshop

点击查看摘要

Abstract:Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p 0.001$), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18\% on counter-intuitive pairs, $p = 0.033$), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality -- estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5--10$\times$. The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.

97. 【2604.00009】Eyla: Toward an Identity-Anchored LLM Architecture with Integrated Biological Priors -- Vision, Implementation Attempt, and Lessons from AI-Assisted Development

链接https://arxiv.org/abs/2604.00009

作者:Arif Aditto

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:episodic memory retrieval, including HiPPO-initialized state-space, unified agent operating, calibrated uncertainty training, proposed identity-anchored LLM

备注: 8 pages, 3 tables, 25 references. Preprint under review for workshop submission

点击查看摘要

Abstract:We present the design rationale, implementation attempt, and failure analysis of Eyla, a proposed identity-anchored LLM architecture that integrates biologically-inspired subsystems -- including HiPPO-initialized state-space models, zero-initialized adapters, episodic memory retrieval, and calibrated uncertainty training -- into a unified agent operating system running on consumer hardware. Unlike existing approaches that optimize models for generic helpfulness, Eyla targets identity consistency: the ability to maintain a coherent self-model under adversarial pressure, admit uncertainty, and resist manipulation. We propose the Identity Consistency Score (ICS), a novel benchmark for evaluating this property across LLMs. We then present an honest account of attempting to implement this architecture using AI coding assistants (Claude Code, Cursor) as a non-programmer, documenting a $1,000+ failure that produced a 1.27B parameter model with 86 brain subsystems contributing less than 2% to output. Our analysis identifies five systematic failure modes of AI-assisted development for novel architectures and offers concrete recommendations. To our knowledge, this is the first paper to combine an architectural vision with a documented first-person failure analysis of AI-assisted LLM development, providing lessons for both the AI systems and AI-assisted software engineering communities.

98. 【2604.00008】How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

链接https://arxiv.org/abs/2604.00008

作者:Songhee Han,Jueun Shin,Jiyoon Han,Bung-Woo Jun,Hilal Ayan Karabatman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:support interpretive analysis, researchers show growing, show growing interest, large language model, growing interest

备注

点击查看摘要

Abstract:As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock's LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.

99. 【2604.00007】Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

链接https://arxiv.org/abs/2604.00007

作者:Jaeik Kim,Woojin Kim,Jihwan Hong,Yejoon Lee,Sieun Hyeon,Mintaek Lim,Yunseok Han,Dogeun Kim,Hoeun Lee,Hyunggeun Kim,Jaeyoung Do

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:unifies text, single architecture, unified models, omnimodal foundation model, model that unifies

备注: Project Page: [this https URL](https://dynin.ai/omni/)

点击查看摘要

Abstract:We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

100. 【2604.00006】Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models

链接https://arxiv.org/abs/2604.00006

作者:Wanxin Li,Denver McNeney,Nivedita Prabhu,Charlene Zhang,Renee Barr,Matthew Kitching,Khanh Dao Duc,Anthony S. Boyce

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:specific personal competencies, AI-powered recruitment tools, distinguish successful candidates, AI-powered recruitment, personnel selection

备注

点击查看摘要

Abstract:AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.

101. 【2604.00005】How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study

链接https://arxiv.org/abs/2604.00005

作者:Moran Sun,Tianlin Li,Yuwei Zheng,Zhenhong Zhou,Aishan Liu,Xianglong Liu,Yang Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:cognition and performance, plays an important, human cognition, important role, role in human

备注: 15 pages, 11 figures

点击查看摘要

Abstract:Emotion plays an important role in human cognition and performance. Motivated by this, we investigate whether analogous emotional signals can shape the behavior of large language models (LLMs) and agents. Existing emotion-aware studies mainly treat emotion as a surface-level style factor or a perception target, overlooking its mechanistic role in task processing. To address this limitation, we propose E-STEER, an interpretable emotion steering framework that enables direct representation-level intervention in LLMs and agents. It embeds emotion as a structured, controllable variable in hidden states, and with it, we examine the impact of emotion on objective reasoning, subjective generation, safety, and multi-step agent behaviors. The results reveal non-monotonic emotion-behavior relations consistent with established psychological theories, and show that specific emotions not only enhance LLM capability but also improve safety, and systematically shape multi-step agent behaviors.

102. 【2604.00004】LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

链接https://arxiv.org/abs/2604.00004

作者:Ning Yang,Hengyu Zhong,Wentao Wang,Baoliang Tian,Haijun Zhang,Jun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, lightweight Continual Pre-Training, Large Language, scaling positional encodings, Continual Pre-Training

备注

点击查看摘要

Abstract:The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at this https URL.

103. 【2604.00003】A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

链接https://arxiv.org/abs/2604.00003

作者:Muhammad Anis Al Hilmi,Neelansh Khare,Noel Framil Iglesias

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Camelot based pipeline, approaches from KRS, Camelot based, KRS documents, LLM

备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic - LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text based academic documents in computationally constrained environments.

104. 【2604.00002】Benchmark for Assessing Olfactory Perception of Large Language Models

链接https://arxiv.org/abs/2604.00002

作者:Eftychia Makri,Nikolaos Nakis,Laura Sisson,Gigi Minsky,Leandros Tassiulas,Vahid Satarifard,Nicholas A. Christakis

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Olfactory Perception, designed to assess, assess the capability, capability of large, isomeric SMILES

备注

点击查看摘要

Abstract:Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

105. 【2604.00001】wo-Stage Optimizer-Aware Online Data Selection for Large Language Models

链接https://arxiv.org/abs/2604.00001

作者:Fangxin Wang,Peyman Baghershahi,Langzhou He,Henry Peng Zou,Sourav Medya,Philip S. Yu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, language model, offline settings, estimating sample utility, offers a principled

备注: 22 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

106. 【2603.29042】An Empirical Recipe for Universal Phone Recognition

链接https://arxiv.org/abs/2603.29042

作者:Shikhar Bharadwaj,Chin-Jou Li,Kwanghee Choi,Eunjung Yeo,William Chen,Shinji Watanabe,David R. Mortensen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Phone recognition, speech processing tasks, low-resource speech processing, processing tasks, performance remains elusive

备注: Submitted to Interspeech 2026. Code: [this https URL](https://github.com/changelinglab/PhoneticXeus)

点击查看摘要

Abstract:Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory features. All data and code are released openly.

信息检索

1. 【2604.01195】ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

链接https://arxiv.org/abs/2604.01195

作者:Nandan Thakur,Zijian Chen,Xueguang Ma,Jimmy Lin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词

备注

点击查看摘要

None

2. 【2604.01186】From Validity to Inter-Subjectivity: An Argument for Reliability Signals in Search Environments

链接https://arxiv.org/abs/2604.01186

作者:Frans van der Sluis

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:spreading misinformation, engines and information, information platforms, platforms are increasingly, increasingly scrutinized

备注: 4 pages. Extended abstract / conference paper for SEASON 2025 (September 24-25, 2025, Hamburg, Germany). Peer reviewed

点击查看摘要

Abstract:Search engines and information platforms are increasingly scrutinized for their role in spreading misinformation. Traditional responses often focus on detecting falsehoods or verifying the ultimate validity of claims. This paper argues that such a validity-centered framing is inadequate for the epistemic challenges of search environments.

3. 【2604.01073】Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics

链接https://arxiv.org/abs/2604.01073

作者:Fred Zimmerman,Hilmar AI

类目:Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:information-theoretic novelty curves, published works, book level, authors, qualifying authors

备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:We test whether authors have characteristic "fingerprints" in the information-theoretic novelty curves of their published works. Working with two corpora -- Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) -- we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

4. 【2604.01036】Aligning Recommendations with User Popularity Preferences

链接https://arxiv.org/abs/2604.01036

作者:Mona Schirmer,Anton Thielmann,Pola Schwöbel,Thomas Martynec,Giuseppe Di Benedetto,Ben London,Yannik Stein

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:favor popular items, disproportionately favor popular, recommendations disproportionately favor, Popularity, pervasive problem

备注: Accepted at FAccT 2026

点击查看摘要

Abstract:Popularity bias is a pervasive problem in recommender systems, where recommendations disproportionately favor popular items. This not only results in "rich-get-richer" dynamics and a homogenization of visible content, but can also lead to misalignment of recommendations with individual users' preferences for popular or niche content. This work studies popularity bias through the lens of user-recommender alignment. To this end, we introduce Popularity Quantile Calibration, a measurement framework that quantifies misalignment between a user's historical popularity preference and the popularity of their recommendations. Building on this notion of popularity alignment, we propose SPREE, an inference-time mitigation method for sequential recommenders based on activation steering. SPREE identifies a popularity direction in representation space and adaptively steers model activations based on an estimate of each user's personal popularity bias, allowing both the direction and magnitude of steering to vary across users. Unlike global debiasing approaches, SPREE explicitly targets alignment rather than uniformly reducing popularity. Experiments across multiple datasets show that SPREE consistently improves user-level popularity alignment while preserving recommendation quality.

5. 【2604.00865】Doctor-RAG: Failure-Aware Repair for Agentic Retrieval-Augmented Generation

链接https://arxiv.org/abs/2604.00865

作者:Shuguang Jiao,Chengkai Huang,Shuhan Qi,Xuan Wang,Yifan Li,Lina Yao

类目:Information Retrieval (cs.IR)

关键词:Agentic Retrieval-Augmented Generation, widely adopted paradigm, Retrieval-Augmented Generation, complex knowledge reasoning, Agentic RAG

备注

点击查看摘要

Abstract:Agentic Retrieval-Augmented Generation (Agentic RAG) has become a widely adopted paradigm for multi-hop question answering and complex knowledge reasoning, where retrieval and reasoning are interleaved at inference time. As reasoning trajectories grow longer, failures become increasingly common. Existing approaches typically address such failures by either stopping at diagnostic analysis or rerunning the entire retrieval-reasoning pipeline, which leads to substantial computational overhead and redundant reasoning. In this paper, we propose Doctor-RAG (DR-RAG), a unified diagnose-and-repair framework that corrects failures in Agentic RAG through explicit error localization and prefix reuse, enabling minimal-cost intervention. DR-RAG decomposes failure handling into two consecutive stages: (i) trajectory-level failure diagnosis and localization, which attributes errors to a coverage-gated taxonomy and identifies the earliest failure point in the reasoning trajectory; and (ii) tool-conditioned local repair, which intervenes only at the diagnosed failure point while maximally reusing validated reasoning prefixes and retrieved evidence. By explicitly separating error attribution from correction, DR-RAG enables precise error localization, thereby avoiding expensive full-pipeline reruns and enabling targeted, efficient repair. We evaluate DR-RAG across three multi-hop question answering benchmarks, multiple agentic RAG baselines, and different backbone models. Experimental results demonstrate that DR-RAG substantially improves answer accuracy while significantly reducing reasoning token consumption compared to rerun-based repair strategies.

6. 【2604.00809】Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

链接https://arxiv.org/abs/2604.00809

作者:Kawtar Zaher,Olivier Buisson,Alexis Joly

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Building on existing, iteratively retrieving images, user Relevance Feedback, existing approaches, consists of iteratively

备注

点击查看摘要

Abstract:Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

7. 【2604.00803】A novel three-step approach to forecast firm-specific technology convergence opportunity via multi-dimensional feature fusion

链接https://arxiv.org/abs/2604.00803

作者:Fu Gu,Ao Chen,Yingwen Wu

类目:Information Retrieval (cs.IR)

关键词:crucial innovation paradigm, gaining ever-increasing attention, innovation paradigm, crucial innovation, gaining ever-increasing

备注

点击查看摘要

Abstract:As a crucial innovation paradigm, technology convergence (TC) is gaining ever-increasing attention. Yet, existing studies primarily focus on predicting TC at the industry level, with little attention paid to TC forecast for firm-specific technology opportunity discovery (TOD). Moreover, although technological documents like patents contain a rich body of bibliometric, network structure, and textual features, such features are underexploited in the extant TC predictions; most of the relevant studies only used one or two dimensions of these features, and all the three dimensional features have rarely been fused. Here we propose a novel approach that fuses multi-dimensional features from patents to predict TC for firm-specific TOD. Our method comprises three steps, which are elaborated as follows. First, bibliometric, network structure, and textual features are extracted from patent documents, and then fused at the International Patent Classification (IPC)-pair level using attention mechanisms. Second, IPC-level TC opportunities are identified using a two-stage ensemble learning model that incorporates various imbalance-handling strategies. Third, to acquire feasible firm-specific TC opportunities, the performance metrics of topic-level TC opportunities, which are refined from IPC-level opportunities, are evaluated via retrieval-augmented generation (RAG) with a large language model (LLM). We prove the effectiveness of our proposed approach by predicting TC opportunities for a leading Chinese auto part manufacturer, Zhejiang Sanhua Intelligent Controls co., ltd, in the domains of thermal management for energy storage and robotics. In sum, this work advances the theory and applicability of forecasting firm-specific TC opportunity through fusing multi-dimensional features and leveraging LLM-as-a-judge for technology opportunity evaluation.

8. 【2604.00731】STCALIR: Semi-Synthetic Test Collection for Algerian Legal Information Retrieval

链接https://arxiv.org/abs/2604.00731

作者:M'hamed Amine Hatem,Sofiane Batata,Amine Mammasse,Faiçal Azouaou

类目:Information Retrieval (cs.IR)

关键词:re-ranking models, essential for evaluating, Test collections, Algerian legal texts, relevance judgments

备注

点击查看摘要

Abstract:Test collections are essential for evaluating retrieval and re-ranking models. However, constructing such collections is challenging due to the high cost of manual annotation, particularly in specialized domains like Algerian legal texts, where high-quality corpora and relevance judgments are scarce. To address this limitation, we propose STCALIR, a framework for generating semi-synthetic test collections directly from raw legal documents. The pipeline follows the Cranfield paradigm, maintaining its core components of topics, corpus, and relevance judgments, while significantly reducing manual effort through automated multi-stage retrieval and filtering, achieving a 99% reduction in annotation workload. We validate STCALIR using the Mr. TyDi benchmark, demonstrating that the resulting semi-synthetic relevance judgments yield retrieval effectiveness comparable to human-annotated evaluations (Hit@10 \approx 0.785). Furthermore, system-level rankings derived from these labels exhibit strong concordance with human-based evaluations, as measured by Kendall's {\tau} (0.89) and Spearman's \r{ho} (0.92). Overall, STCALIR offers a reproducible and cost-efficient solution for constructing reliable test collections in low-resource legal domains.

9. 【2604.00672】Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

链接https://arxiv.org/abs/2604.00672

作者:Zeyad Ahmed,Paul Sheridan,Michael McIsaac,Aitazaz A. Farooque

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Statistics Theory (math.ST)

关键词:identifying important terms, classical formula, identifying important, word burstiness, important terms

备注: 27 pages, 3 tables, 7 figures, accepted in Discover Computing 2026

点击查看摘要

Abstract:TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

10. 【2604.00590】UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems

链接https://arxiv.org/abs/2604.00590

作者:Mingming Ha,Guanchen Wang,Linxun Chen,Xuan Rao,Yuexin Shi,Tianbao Ma,Zhaojie Liu,Yunqian Fan,Zilong Lu,Yanan Niu,Han Li,Kun Gai

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:attracted increasing attention, recent years, increasing attention, attracted increasing, govern the relationship

备注

点击查看摘要

Abstract:In recent years, the scaling laws of recommendation models have attracted increasing attention, which govern the relationship between performance and parameters/FLOPs of recommenders. Currently, there are three mainstream architectures for achieving scaling in recommendation models, namely attention-based, TokenMixer-based, and factorization-machine-based methods, which exhibit fundamental differences in both design philosophy and architectural structure. In this paper, we propose a unified scaling architecture for recommendation systems, namely \textbf{UniMixer}, to improve scaling efficiency and establish a unified theoretical framework that unifies the mainstream scaling blocks. By transforming the rule-based TokenMixer to an equivalent parameterized structure, we construct a generalized parameterized feature mixing module that allows the token mixing patterns to be optimized and learned during model training. Meanwhile, the generalized parameterized token mixing removes the constraint in TokenMixer that requires the number of heads to be equal to the number of tokens. Furthermore, we establish a unified scaling module design framework for recommender systems, which bridges the connections among attention-based, TokenMixer-based, and factorization-machine-based methods. To further boost scaling ROI, a lightweight UniMixing module is designed, \textbf{UniMixing-Lite}, which further compresses the model parameters and computational cost while significantly improve the model performance. The scaling curves are shown in the following figure. Extensive offline and online experiments are conducted to verify the superior scaling abilities of \textbf{UniMixer}.

11. 【2604.00523】Lipschitz Dueling Bandits over Continuous Action Spaces

链接https://arxiv.org/abs/2604.00523

作者:Mudit Sharma,Shweta Jain,Vaneet Aggarwal,Ganesh Ghalme

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)

关键词:stochastic dueling bandits, purely comparative, Lipschitz structure, Lipschitz dueling bandits, stochastic dueling

备注

点击查看摘要

Abstract:We study for the first time, stochastic dueling bandits over continuous action spaces with Lipschitz structure, where feedback is purely comparative. While dueling bandits and Lipschitz bandits have been studied separately, their combination has remained unexplored. We propose the first algorithm for Lipschitz dueling bandits, using round-based exploration and recursive region elimination guided by an adaptive reference arm. We develop new analytical tools for relative feedback and prove a regret bound of $\tilde O\left(T^{\frac{d_z+1}{d_z+2}}\right)$, where $d_z$ is the zooming dimension of the near-optimal region. Further, our algorithm takes only logarithmic space in terms of the total time horizon, best achievable by any bandit algorithm over a continuous action space.

12. 【2604.00513】MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

链接https://arxiv.org/abs/2604.00513

作者:Junxian Wu,Chenghan Fu,Zhanheng Nie,Daoze Zhang,Bowen Wan,Wanxian Guan,Chuan Yu,Jian Xu,Bo Zheng

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:exploring general representations, attracted increasing attention, exploring general, rapid growth, attracted increasing

备注: 10 pages, 6 figures

点击查看摘要

Abstract:With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

13. 【2604.00500】Evidence Units: Ontology-Grounded Document Organization for Parser-Independent Retrieval

链接https://arxiv.org/abs/2604.00500

作者:Yeonjee Han

类目:Information Retrieval (cs.IR)

关键词:figures with explanations, Structured documents, tables paired, paired with captions, routinely fragmented

备注: 16 pages, 4 figures

点击查看摘要

Abstract:Structured documents--tables paired with captions, figures with explanations, equations with the paragraphs that interpret them--are routinely fragmented when indexed for retrieval. Element-level indexing treats every parsed element as an independent chunk, scattering semantically cohesive units across separate retrieval candidates. This paper presents a parser-independent pipeline that constructs Evidence Units (EUs): semantically complete document chunks that group visual assets with their contextual text. We introduce four contributions: (1) ontology-grounded role normalization extending DoCO that maps heterogeneous parser outputs to a unified semantic schema; (2) a semantic global assignment algorithm that optimally assigns paragraphs to EUs via a full similarity matrix; (3) a graph-based decision layer in Neo4j that formalizes EU construction rules and validates completeness through two invariants; and (4) cross-parser validation showing EU spatial footprints converge across MinerU and Docling, with gains preserved under parser-induced bbox variance. Experiments on OmniDocBench v1.0 (1,340 pages; 1,551 QA pairs) show EU-based chunking improves retrieval LCS by +0.31 (0.50 to 0.81). Recall@1 increases from 0.15 to 0.51 (3.4x) and MinK decreases from 2.58 to 1.72. Cross-parser results confirm the gain (LCS +0.23 to +0.31) is preserved across parsers. Text queries show the most dramatic gain: Recall@1 rises from 0.08 to 0.47.

14. 【2604.00242】FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

链接https://arxiv.org/abs/2604.00242

作者:Antonín Jarolím,Martin Fajčík

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:identifies relevant documents, specific relevant spans, Document retrieval identifies, fine-grained evidence cues, provide fine-grained evidence

备注

点击查看摘要

Abstract:Document retrieval identifies relevant documents but does not provide fine-grained evidence cues, such as specific relevant spans. A possible solution is to apply an LLM after retrieval; however, this introduces significant computational overhead and limits practical deployment. We propose FGR-ColBERT, a modification of ColBERT retrieval model that integrates fine-grained relevance signals distilled from an LLM directly into the retrieval function. Experiments on MS MARCO show that FGR-ColBERT (110M) achieves a token-level F1 of 64.5, exceeding the 62.8 of Gemma 2 (27B), despite being approximately 245 times smaller. At the same time, it preserves retrieval effectiveness (99% relative Recall@50) and remains efficient, incurring only a ~1.12x latency overhead compared to the original ColBERT.

15. 【2604.00006】Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models

链接https://arxiv.org/abs/2604.00006

作者:Wanxin Li,Denver McNeney,Nivedita Prabhu,Charlene Zhang,Renee Barr,Matthew Kitching,Khanh Dao Duc,Anthony S. Boyce

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:specific personal competencies, AI-powered recruitment tools, distinguish successful candidates, AI-powered recruitment, personnel selection

备注

点击查看摘要

Abstract:AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.

16. 【2604.00003】A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction

链接https://arxiv.org/abs/2604.00003

作者:Muhammad Anis Al Hilmi,Neelansh Khare,Noel Framil Iglesias

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Camelot based pipeline, approaches from KRS, Camelot based, KRS documents, LLM

备注: 9 pages, 6 figures, 3 tables

点击查看摘要

Abstract:This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic - LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM methods is increasingly reliable and efficient for information extraction from text based academic documents in computationally constrained environments.

计算机视觉

1. 【2604.01221】HippoCamp: Benchmarking Contextual Agents on Personal Computers

链接https://arxiv.org/abs/2604.01221

作者:Zhe Yang,Shulin Tian,Kairui Hu,Shuai Liu,Hoang-Nhat Nguyen,Yichi Zhang,Zujin Guo,Mengying Yu,Zinan Zhang,Jingkang Yang,Chen Change Loy,Ziwei Liu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal file management, file management, benchmark designed, agents' capabilities, Abstract

备注: Project Page: [this https URL](https://hippocamp-ai.github.io/)

点击查看摘要

Abstract:We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

2. 【2604.01216】LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

链接https://arxiv.org/abs/2604.01216

作者:Yuxuan Bao,Xingyue Zhang,J. Nathan Kutz

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reconstructing full spatio-temporal, Reconstructing full, full spatio-temporal dynamics, remains a central, central challenge

备注

点击查看摘要

Abstract:Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

3. 【2604.01207】RACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

链接https://arxiv.org/abs/2604.01207

作者:Jiyuan Hu,Zechuan Zhang,Zongxin Yang,Yi Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high-fidelity scene transformation, Contextual Video Masking, Tangible Geometry Anchoring, present TRACE, achieves automated

备注: 22 pages, 9 figures

点击查看摘要

Abstract:We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio--such as local pose shifting or component replacemen--while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase--the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio--to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

4. 【2604.01204】Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

链接https://arxiv.org/abs/2604.01204

作者:Jorge Condor,Nicolas Moenne-Loccoz,Merlin Nimier-David,Piotr Didyk,Zan Gojcic,Qi Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Gaussian Splatting, related reconstruction tasks, Neural Harmonic Textures, Gaussian, Harmonic Textures

备注

点击查看摘要

Abstract:Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.

5. 【2604.01181】rue (VIS) Lies: Analyzing How Generative AI Recognizes Intentionality, Rhetoric, and Misleadingness in Visualization Lies

链接https://arxiv.org/abs/2604.01181

作者:Graziano Blasilli,Marco Angelini

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal Large Language, Large Language Models, interpret misleading visualizations, misleading visualizations, Large Language

备注

点击查看摘要

Abstract:This study investigates the ability of multimodal Large Language Models (LLMs) to identify and interpret misleading visualizations, and recognize these observations along with their underlying causes and potential intentionality. Our analysis leverages concepts from visualization rhetoric and a newly developed taxonomy of authorial intents as explanatory lenses. We formulated three research questions and addressed them experimentally using a dataset of 2,336 COVID-19-related tweets, half of which contain misleading visualizations, and supplemented it with real-world examples of perceptual, cognitive, and conceptual errors drawn from VisLies, the IEEE VIS community event dedicated to showcasing deceptive and misleading visualizations. To ensure broad coverage of the current LLM landscape, we evaluated 16 state-of-the-art models. Among them, 15 are open-weight models, spanning a wide range of model sizes, architectural families, and reasoning capabilities. The selection comprises small models, namely Nemotron-Nano-V2-VL (12B parameters), Mistral-Small-3.2 (24B), DeepSeek-VL2 (27B), Gemma3 (27B), and GTA1 (32B); medium-sized models, namely Qianfan-VL (70B), Molmo (72B), GLM-4.5V (108B), LLaVA-NeXT (110B), and Pixtral-Large (124B); and large models, namely Qwen3-VL (235B), InternVL3.5 (241B), Step3 (321B), Llama-4-Maverick (400B), and Kimi-K2.5 (1000B). In addition, we employed OpenAI GPT-5.4, a frontier proprietary model. To establish a human perspective on these tasks, we also conducted a user study with visualization experts to assess how people perceive rhetorical techniques and the authorial intentions behind the same misleading visualizations. This allows comparison between model and expert behavior, revealing similarities and differences that provide insights into where LLMs align with human judgment and where they diverge.

6. 【2604.01179】A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

链接https://arxiv.org/abs/2604.01179

作者:J. E. Domínguez-Vidal

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:narrow task-specific pipelines, provide richer semantic, richer semantic perception, Foundation vision-language models, Foundation vision-language

备注: 5 pages, 1 figure

点击查看摘要

Abstract:Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: this https URL

7. 【2604.01171】Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects

链接https://arxiv.org/abs/2604.01171

作者:Hanzhe Liang,Luocheng Zhang,Junyang Xia,HanLiang Zhou,Bingyang Guo,Yingxi Xie,Can Gao,Ruiyun Yu,Jinbao Wang,Pan Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:acquiring high-precision point, real manufacturing scenarios, anomaly detection assumes, computationally expensive, anomaly detection

备注: Resources: [this https URL](https://github.com/hzzzzzhappy/open-industry)

点击查看摘要

Abstract:Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to identify unknown anomalies at test time. We present Open-Industry, a high-quality industrial dataset containing 15 categories, each with five real anomaly types collected from production lines. We first adapt general open-set anomaly detection methods to accommodate 3D point cloud inputs better. Building upon this, we propose Open3D-AD, a point-cloud-oriented approach that leverages normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. Then, we introduce a simple Correspondence Distributions Subsampling to reduce the overlap between normal and non-normal distributions, enabling stronger dual distributions modeling. Based on these contributions, we establish a comprehensive benchmark and evaluate the proposed method extensively on Open-Industry as well as established datasets including Real3D-AD and Anomaly-ShapeNet. Benchmark results and ablation studies demonstrate the effectiveness of Open3D-AD and further reveal the potential of open-set supervised 3D anomaly detection.

8. 【2604.01141】Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach

链接https://arxiv.org/abs/2604.01141

作者:Maofeng Tang,Hairong Qi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:remote sensing imagery, hyperspectral image analysis, mixing model, sensing imagery, large footprint

备注

点击查看摘要

Abstract:Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.

9. 【2604.01130】oward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling

链接https://arxiv.org/abs/2604.01130

作者:Zhantao Chen,Dongyi He,Jin Fang,Xi Chen,Yisuo Liu,Xiaozhen Zhong,Xuejun Hu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:traditional dart coaching, experience and visual, visual observation, observation is increasingly, increasingly inadequate

备注

点击查看摘要

Abstract:As sports training becomes more data-driven, traditional dart coaching based mainly on experience and visual observation is increasingly inadequate for high-precision, goal-oriented movements. Although prior studies have highlighted the importance of release parameters, joint motion, and coordination in dart throwing, most quantitative methods still focus on local variables, single-release metrics, or static template matching. These approaches offer limited support for personalized training and often overlook useful movement variability. This paper presents a data-driven dart training assistance system. The system creates a closed-loop framework spanning motion capture, feature modeling, and personalized feedback. Dart-throwing data were collected in markerless conditions using a Kinect 2.0 depth sensor and an optical camera. Eighteen kinematic features were extracted from four biomechanical dimensions: three-link coordination, release velocity, multi-joint angular configuration, and postural stability. Two modules were developed: a personalized optimal throwing trajectory model that combines historical high-quality samples with the minimum jerk criterion, and a motion deviation diagnosis and recommendation model based on z-scores and hierarchical logic. A total of 2,396 throwing samples from professional and non-professional athletes were collected. Results show that the system generates smooth personalized reference trajectories consistent with natural human movement. Case studies indicate that it can detect poor trunk stability, abnormal elbow displacement, and imbalanced velocity control, then provide targeted recommendations. The framework shifts dart evaluation from deviation from a uniform standard to deviation from an individual's optimal control range, improving personalization and interpretability for darts training and other high-precision target sports.

10. 【2604.01129】ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation

链接https://arxiv.org/abs/2604.01129

作者:Hao Zhang,Lue Fan,Weikang Bian,Zehuan Wu,Lewei Lu,Zhaoxiang Zhang,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:freely edit actor, edit actor trajectories, simulate safety-critical corner, safety-critical corner cases, enables full controllability

备注: Project page: [this https URL](https://drive-sim.github.io/ReinDriveGen/)

点击查看摘要

Abstract:We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.

11. 【2604.01118】Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

链接https://arxiv.org/abs/2604.01118

作者:Reyhaneh Ahani Manghotay(Simon Fraser University, Burnaby, Canada),Jie Liang(Eastern Institute of Technology, Ningbo, China)

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:monocular depth estimation, Leveraging the rich, requires extensive fine-tuning, lacks geometric precision, rich semantic features

备注: 14 pages, 2 figures

点击查看摘要

Abstract:Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $\delta_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

12. 【2604.01116】ProTPS: Prototype-Guided Text Prompt Selection for Continual Learning

链接https://arxiv.org/abs/2604.01116

作者:Jie Mei,Li-Leng Peng,Keith Fuller,Jenq-Neng Hwang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leverage text encoders, sequentially arrived classes, encode semantic features, unique text prompts, text prompts

备注

点击查看摘要

Abstract:For continual learning, text-prompt-based methods leverage text encoders and learnable prompts to encode semantic features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn unique text prompts, which implicitly carry semantic information of new classes, so that the semantic features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS)'' to intentionally increase the training flexibility thus encouraging the learning of unique text prompts. Specifically, our ProTPS learns class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts for each class. We first evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Because our ProTPS achieves performance close to the upper bounds, we further collect a real-world dataset with 112 marine species collected over a span of six years, named Marine112, to bring new challenges to the community. Marine112 is authentically suited for the class and domain incremental (CDI) learning setting and is under natural long-tail distribution. The results under three settings show that our ProTPS performs favorably against the recent state-of-the-art methods. The implementation code and Marine112 dataset will be released upon the acceptance of our paper.

13. 【2604.01083】RACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

链接https://arxiv.org/abs/2604.01083

作者:Awais Khan,Muhammad Umar Farooq,Kutub Uddin,Khalid Malik

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:audio remains authentic, remains authentic, synthesized segments, segments are spliced, speech foundation

备注

点击查看摘要

Abstract:Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.

14. 【2604.01082】ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

链接https://arxiv.org/abs/2604.01082

作者:Yaoqin Ye,Yiteng Xu,Qin Sun,Xinge Zhu,Yujing Sun,Yuexin Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:individual motion shaped, behaviors in real-world, real-world environments, environments are inherently, shaped by surrounding

备注: accepted by CVPR 2026, project page: [this https URL](https://4dvlab.github.io/project_page/remogen/)

点击查看摘要

Abstract:Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

15. 【2604.01081】ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

链接https://arxiv.org/abs/2604.01081

作者:Yuheng Zhang,Mengfei Duan,Kunyu Peng,Yuhang Wang,Di Wen,Danda Pani Paudel,Luc Van Gool,Kailun Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:long-tailed class bias, overconfidently assigning anomalies, rare classes, central to autonomous, vulnerable to long-tailed

备注: Accepted to CVPR 2026. The source code is publicly available at [this https URL](https://github.com/7uHeng/ProOOD)

点击查看摘要

Abstract:3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at this https URL.

16. 【2604.01053】PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

链接https://arxiv.org/abs/2604.01053

作者:Zilong Li,Dongyang Li,Chenglong Ma,Zhan Feng,Dakai Jin,Junping Zhang,Hao Luo,Fan Wang,Hongming Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Contrast-enhanced computed tomography, highlighting tissue perfusion, Contrast-enhanced computed, computed tomography, perfusion and vascularity

备注

点击查看摘要

Abstract:Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.

17. 【2604.01044】A global dataset of continuous urban dashcam driving

链接https://arxiv.org/abs/2604.01044

作者:Md Shadab Alam,Olena Bazilinska,Pavlo Bazilinskyy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:City Road Observations, facing urban dashcam, front facing urban, manually curated dataset, City Road

备注

点击查看摘要

Abstract:We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.

18. 【2604.01043】ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

链接https://arxiv.org/abs/2604.01043

作者:Fengyuan Yang,Luying Huang,Jiazhi Guan,Quanwei Yang,Dongwei Pan,Jianglin Fu,Haocheng Feng,Wei He,Kaisiyuan Wang,Hang Zhou,Angela Yao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Foundation Models, Foundation Models, revolutionized human-centric video, Recent advances, critical challenge

备注: 23 pages, 7 figures

点击查看摘要

Abstract:Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: this https URL.

19. 【2604.01038】Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation

链接https://arxiv.org/abs/2604.01038

作者:Qiaochu Zhao,Wei Wei,David Horowitz,Richard Bakst,Yading Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated medical image, achieved remarkable progress, Automated medical, medical image segmentation, achieved remarkable

备注: 5 pages, 5 figures. Accepted for presentation at IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.

20. 【2604.01032】Sub-metre Lunar DEM Generation and Validation from Chandrayaan-2 OHRC Multi-View Imagery Using Open-Source Photogrammetry

链接https://arxiv.org/abs/2604.01032

作者:Aaranay Aadi,Jai Singla,Nitant Dube,Oleg Alexandrov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:surface mobility planning, landing site characterization, High-resolution digital elevation, High-resolution digital, mobility planning

备注: 17 pages, 8 figures

点击查看摘要

Abstract:High-resolution digital elevation models (DEMs) of the lunar surface are essential for surface mobility planning, landing site characterization, and planetary science. The Orbiter High Resolution Camera (OHRC) on board Chandrayaan-2 has the best ground sampling capabilities of any lunar orbital imaging currently in use by acquiring panchromatic imagery at a resolution of roughly 20-30 cm per pixel. This work presents, for the first time, the generation of sub-metre DEMs from OHRC multi-view imagery using an exclusively open-source pipeline. Candidate stereo pairs are identified from non-paired OHRC archives through geometric analysis of image metadata, employing baseline-to-height (B/H) ratio computation and convergence angle estimation. Dense stereo correspondence and ray triangulation are then applied to generate point clouds, which are gridded into DEMs at effective spatial resolutions between approximately 24 and 54 cm across five geographically distributed lunar sites. Absolute elevation consistency is established through Iterative Closest Point (ICP) alignment against Lunar Reconnaissance Orbiter Narrow Angle Camera (NAC) Digital Terrain Models, followed by constant-bias offset correction. Validation against NAC reference terrain yields a vertical RMSE of 5.85 m (at native OHRC resolution), and a horizontal accuracy of less than 30 cm assessed by planimetric feature matching.

21. 【2604.01030】Diff3R: Feed-forward 3D Gaussian Splatting with Uncertainty-aware Differentiable Optimization

链接https://arxiv.org/abs/2604.01030

作者:Yueh-Cheng Liu,Jozef Hladký,Matthias Nießner,Angela Dai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, offer fast inference, yields high-quality renderings, Recent advances, per-scene optimization yields

备注: Project page: [this https URL](https://liu115.github.io/diff3r) , Video: [this https URL](https://www.youtube.com/watch?v=IxzNSAdUY70)

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) present two main directions: feed-forward models offer fast inference in sparse-view settings, while per-scene optimization yields high-quality renderings but is computationally expensive. To combine the benefits of both, we introduce Diff3R, a novel framework that explicitly bridges feed-forward prediction and test-time optimization. By incorporating a differentiable 3DGS optimization layer directly into the training loop, our network learns to predict an optimal initialization for test-time optimization rather than a conventional zero-shot result. To overcome the computational cost of backpropagating through the optimization steps, we propose computing gradients via the Implicit Function Theorem and a scalable, matrix-free PCG solver tailored for 3DGS optimization. Additionally, we incorporate a data-driven uncertainty model into the optimization process by adaptively controlling how much the parameters are allowed to change during optimization. This approach effectively mitigates overfitting in under-constrained regions and increases robustness against input outliers. Since our proposed optimization layer is model-agnostic, we show that it can be seamlessly integrated into existing feed-forward 3DGS architectures for both pose-given and pose-free methods, providing improvements for test-time optimization.

22. 【2604.01015】Forecasting Motion in the Wild

链接https://arxiv.org/abs/2604.01015

作者:Neerja Thakkar,Shiry Ginosar,Jacob Walker,Jitendra Malik,Joao Carreira,Carl Doersch

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision systems lack, intelligence requires anticipating, Visual intelligence requires, requires anticipating, anticipating the future

备注: project page: [this https URL](https://motion-forecasting.github.io/)

点击查看摘要

Abstract:Visual intelligence requires anticipating the future behavior of agents, yet vision systems lack a general representation for motion and behavior. We propose dense point trajectories as visual tokens for behavior, a structured mid-level representation that disentangles motion from appearance and generalizes across diverse non-rigid agents, such as animals in-the-wild. Building on this abstraction, we design a diffusion transformer that models unordered sets of trajectories and explicitly reasons about occlusion, enabling coherent forecasts of complex motion patterns. To evaluate at scale, we curate 300 hours of unconstrained animal video with robust shot detection and camera-motion compensation. Experiments show that forecasting trajectory tokens achieves category-agnostic, data-efficient prediction, outperforms state-of-the-art baselines, and generalizes to rare species and morphologies, providing a foundation for predictive visual intelligence in the wild.

23. 【2604.01014】AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration

链接https://arxiv.org/abs/2604.01014

作者:Ruhao Liu,Weiqi Huang,Qi Li,Xinchao Wang

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental auditing tool, evaluating training data, training data leakage, machine learning models, Membership Inference

备注

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) serve as a fundamental auditing tool for evaluating training data leakage in machine learning models. However, existing methodologies predominantly rely on static, handcrafted heuristics that lack adaptability, often leading to suboptimal performance when transferred across different large models. In this work, we propose AutoMIA, an agentic framework that reformulates membership inference as an automated process of self-exploration and strategy evolution. Given high-level scenario specifications, AutoMIA self-explores the attack space by generating executable logits-level strategies and progressively refining them through closed-loop evaluation feedback. By decoupling abstract strategy reasoning from low-level execution, our framework enables a systematic, model-agnostic traversal of the attack search space. Extensive experiments demonstrate that AutoMIA consistently matches or outperforms state-of-the-art baselines while eliminating the need for manual feature engineering.

24. 【2604.01010】PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

链接https://arxiv.org/abs/2604.01010

作者:Jingning Xu,Haochen Luo,Chen Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Vision-language models, Vision-language, PDA, adversarial, adversarial image

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

25. 【2604.01002】Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding

链接https://arxiv.org/abs/2604.01002

作者:Yiheng Wang,Lichen Zhu,Yueqian Lin,Yudong Liu,Jingyang Zhang,Hai "Helen" Li,Yiran Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to independent frame-level scoring. We further introduce a query-conditioned evidence scoring network trained with a contrastive objective to estimate evidential importance efficiently. Experiments on long-form video understanding benchmarks show that our method consistently outperforms prior sampling strategies under strict token budgets, while significantly improving training efficiency.

26. 【2604.01001】EgoSim: Egocentric World Simulator for Embodied Interaction Generation

链接https://arxiv.org/abs/2604.01001

作者:Jinkun Hao,Mingda Jia,Ruiyan Wang,Xihui Liu,Ran Yi,Lizhuang Ma,Jiangmiao Pang,Xudong Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:generates spatially consistent, spatially consistent interaction, spatially consistent, closed-loop egocentric world, Interaction-aware State Updating

备注: Project Page: [this http URL](http://egosimulator.github.io)

点击查看摘要

Abstract:We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at this http URL.

27. 【2604.00998】Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise

链接https://arxiv.org/abs/2604.00998

作者:Jiacheng Liao,Feng Qian,Ziyin Fan,Yongjian Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:degrading subsequent imaging, vertical seismic profiling, severely masking reflection, masking reflection events, severely masking

备注

点击查看摘要

Abstract:Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.

28. 【2604.00985】Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging

链接https://arxiv.org/abs/2604.00985

作者:Weixi Yi,Yipei Wang,Wen Yan,Hanyuan Zhang,Natasha Thorley,Alexander Ng,Shonit Punwani,Fernando Bianco,Mark Emberton,Veeru Kasivisvanathan,Dean C. Barratt,Shaheer U. Saeed,Yipeng Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multiparametric MRI, MRI is increasingly, first-line noninvasive approach, requiring at minimum, minimum diffusion-weighted

备注

点击查看摘要

Abstract:Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4\% and zone-level QWK by 5.3\% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.

29. 【2604.00983】ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration

链接https://arxiv.org/abs/2604.00983

作者:Bei Yan,Yuecong Min,Jie Zhang,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, severe hallucination issues, frequently suffer, Large Vision-Language, Vision-Language Models

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.

30. 【2604.00969】DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving

链接https://arxiv.org/abs/2604.00969

作者:Yiyao Zhu,Ying Xue,Haiming Zhang,Guangfeng Jiang,Wending Zhou,Xu Yan,Jiantao Gao,Yingjie Cai,Bingbing Liu,Zhen Li,Shaojie Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bird Eye View, Vision-based autonomous driving, Vision-based autonomous, Latent World Models, gained much attention

备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Vision-based autonomous driving has gained much attention due to its low costs and excellent performance. Compared with dense BEV (Bird's Eye View) or sparse query models, Gaussian-centric method is a comprehensive yet sparse representation by describing scene with 3D semantic Gaussians. In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. In the first stage, DLWM predicts 3D Gaussians from queries by self-supervised reconstructing multi-view semantic and depth images. Equipped with fine-grained contextual features, in the second stage, two latent world models are trained separately for temporal feature learning, including Gaussian-flow-guided latent prediction for downstream occupancy perception and forecasting tasks, and ego-planning-guided latent prediction for motion planning. Extensive experiments in SurroundOcc and nuScenes benchmarks demonstrate that DLWM shows significant performance gains across Gaussian-centric 3D occupancy perception, 4D occupancy forecasting and motion planning tasks.

31. 【2604.00955】Enhancing Gradient Inversion Attacks in Federated Learning via Hierarchical Feature Optimization

链接https://arxiv.org/abs/2604.00955

作者:Hao Fang,Wenbo Yu,Bin Chen,Xuan Wang,Shu-Tao Xia,Qing Liao,Ke Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:distributed machine learning, allowing multiple clients, Federated Learning, privacy-preserving distributed machine, transmitting locally computed

备注

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a compelling paradigm for privacy-preserving distributed machine learning, allowing multiple clients to collaboratively train a global model by transmitting locally computed gradients to a central server without exposing their private data. Nonetheless, recent studies find that the gradients exchanged in the FL system are also vulnerable to privacy leakage, e.g., an attacker can invert shared gradients to reconstruct sensitive data by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, existing attacks simply perform gradient inversion in the latent space of the GAN model, which limits their expression ability and generalizability. To tackle these challenges, we propose \textbf{G}radient \textbf{I}nversion over \textbf{F}eature \textbf{D}omains (GIFD), which disassembles the GAN model and searches the hierarchical features of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small ${l_1}$ ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Furthermore, we consider the challenging OOD scenario of label inconsistency and propose a label mapping technique as an effective solution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and outperform competitive baselines across a variety of FL scenarios.

32. 【2604.00940】YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction

链接https://arxiv.org/abs/2604.00940

作者:Miro Miranda,Deepak Pathak,Patrick Helber,Benjamin Bischke,Hiba Najjar,Francisco Mena,Cristhian Sanchez,Akshay Pai,Diego Arenas,Matias Valdenegro-Toro,Marcela Charfuelan,Marlon Nuske,Andreas Dengel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prediction requires substantial, Crop yield prediction, requires substantial data, yield prediction requires, yield prediction

备注

点击查看摘要

Abstract:Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at this https URL.

33. 【2604.00933】EmoScene: A Dual-space Dataset for Controllable Affective Image Generation

链接https://arxiv.org/abs/2604.00933

作者:Li He,Longtai Zhang,Wenqiang Zhang,Yan Wang,Lizhe Qi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tone remains challenging, high visual fidelity, achieved high visual, fine-grained affective tone, affective tone remains

备注

点击查看摘要

Abstract:Text-to-image diffusion models have achieved high visual fidelity, yet precise control over scene semantics and fine-grained affective tone remains challenging. Human visual affect arises from the rapid integration of contextual meaning, including valence, arousal, and dominance, with perceptual cues such as color harmony, luminance contrast, texture variation, curvature, and spatial layout. However, current text-to-image models rarely represent affective and perceptual factors within a unified representation, which limits their ability to synthesize scenes with coherent and nuanced emotional intent. To address this gap, we construct EmoScene, a large-scale dual-space emotion dataset that jointly encodes affective dimensions and perceptual attributes, with contextual semantics provided as supporting annotations. EmoScene contains 1.2M images across more than three hundred real-world scene categories, each annotated with discrete emotion labels, continuous VAD values, perceptual descriptors and textual captions. Multi-space analyses reveal how discrete emotions occupy the VAD space and how affect systematically correlates with scene-level perceptual factors. To benchmark EmoScene, we provide a lightweight reference baseline that injects dual-space controls into a frozen diffusion backbone via shallow cross-attention modulation, serving as a reproducible probe of affect controllability enabled by dual-space supervision.

34. 【2604.00928】Autoregressive Appearance Prediction for 3D Gaussian Avatars

链接https://arxiv.org/abs/2604.00928

作者:Michael Steiner,Zhang Chen,Alexander Richard,Vasu Agrawal,Markus Steinberger,Michael Zollhöfer

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:subtle facial expressions, demands capturing fine, characteristic motion patterns, experience demands capturing, human avatar experience

备注: Project Page: [this https URL](https://steimich96.github.io/AAP-3DGA/)

点击查看摘要

Abstract:A photorealistic and immersive human avatar experience demands capturing fine, person-specific details such as cloth and hair dynamics, subtle facial expressions, and characteristic motion patterns. Achieving this requires large, high-quality datasets, which often introduce ambiguities and spurious correlations when very similar poses correspond to different appearances. Models that fit these details during training can overfit and produce unstable, abrupt appearance changes for novel poses. We propose a 3D Gaussian Splatting avatar model with a spatial MLP backbone that is conditioned on both pose and an appearance latent. The latent is learned during training by an encoder, yielding a compact representation that improves reconstruction quality and helps disambiguate pose-driven renderings. At driving time, our predictor autoregressively infers the latent, producing temporally smooth appearance evolution and improved stability. Overall, our method delivers a robust and practical path to high-fidelity, stable avatar driving.

35. 【2604.00927】Learning Quantised Structure-Preserving Motion Representations for Dance Fingerprinting

链接https://arxiv.org/abs/2604.00927

作者:Arina Kharlamova,Bowei He,Chen Ma,Xue Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:identifying semantically similar, semantically similar choreographies, similar choreographies directly, framework for motion-based, raw video

备注

点击查看摘要

Abstract:We present DANCEMATCH, an end-to-end framework for motion-based dance retrieval, the task of identifying semantically similar choreographies directly from raw video, defined as DANCE FINGERPRINTING. While existing motion analysis and retrieval methods can compare pose sequences, they rely on continuous embeddings that are difficult to index, interpret, or scale. In contrast, DANCEMATCH constructs compact, discrete motion signatures that capture the spatio-temporal structure of dance while enabling efficient large-scale retrieval. Our system integrates Skeleton Motion Quantisation (SMQ) with Spatio-Temporal Transformers (STT) to encode human poses, extracted via Apple CoMotion, into a structured motion vocabulary. We further design DANCE RETRIEVAL ENGINE (DRE), which performs sub-linear retrieval using a histogram-based index followed by re-ranking for refined matching. To facilitate reproducible research, we release DANCETYPESBENCHMARK, a pose-aligned dataset annotated with quantised motion tokens. Experiments demonstrate robust retrieval across diverse dance styles and strong generalisation to unseen choreographies, establishing a foundation for scalable motion fingerprinting and quantitative choreographic analysis.

36. 【2604.00921】Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

链接https://arxiv.org/abs/2604.00921

作者:Dylan B. Lewis,Jens Gregor,Hector Santos-Villalobos

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Modern vision pipelines, vision pipelines increasingly, pipelines increasingly rely, Modern vision, pretrained image encoders

备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

37. 【2604.00913】Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

链接https://arxiv.org/abs/2604.00913

作者:Zhuchenyang Liu,Yao Zhang,Yu Xiao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:detect errors, monitor progress, intelligent assistants, assembly diagrams, Abstract

备注

点击查看摘要

Abstract:2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: this https URL

38. 【2604.00912】ProCap: Projection-Aware Captioning for Spatial Augmented Reality

链接https://arxiv.org/abs/2604.00912

作者:Zimo Cao,Yuchen Deng,Haibin Ling,Bingyao Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Spatial augmented reality, creating immersive experience, Spatial augmented, directly projects digital, augmented reality

备注: 16 pages, 7 figures

点击查看摘要

Abstract:Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: this https URL.

39. 【2604.00909】JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

链接https://arxiv.org/abs/2604.00909

作者:Issa Sugiura,Koki Maeda,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Naoaki Okazaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Japanese VQA, development of vision-language, Japanese VQA benchmarks, Japanese, evaluation

备注: 16 pages, 11 figures

点击查看摘要

Abstract:Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.

40. 【2604.00903】IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

链接https://arxiv.org/abs/2604.00903

作者:Linyan Dai,Xinwei Zhang,Haoyang Li,Qingqing Ye,Haibo Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:synthesize high-fidelity avatars, synthesize high-fidelity, high-fidelity avatars, enable users, social expression

备注

点击查看摘要

Abstract:Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2604.00903 [cs.CV]

(or
arXiv:2604.00903v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.00903

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
41. 【2604.00897】Super-Resolving Coarse-Resolution Weather Forecasts With Flow Matching

链接https://arxiv.org/abs/2604.00897

作者:Aymeric Delefosse,Anastase Charantonis,Dominique Béréziat

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine learning-based weather, numerical weather prediction, weather prediction systems, remains computationally expensive, resolution remains computationally

备注: Accepted to Climate Informatics 2026

点击查看摘要

Abstract:Machine learning-based weather forecasting models now surpass state-of-the-art numerical weather prediction systems, but training and operating these models at high spatial resolution remains computationally expensive. We present a modular framework that decouples forecasting from spatial resolution by applying learned generative super-resolution as a post-processing step to coarse-resolution forecast trajectories. We formulate super-resolution as a stochastic inverse problem, using a residual formulation to preserve large-scale structure while reconstructing unresolved variability. The model is trained with flow matching exclusively on reanalysis data and is applied to global medium-range forecasts. We evaluate (i) design consistency by re-coarsening super-resolved forecasts and comparing them to the original coarse trajectories, and (ii) high-resolution forecast quality using standard ensemble verification metrics and spectral diagnostics. Results show that super-resolution preserves large-scale structure and variance after re-coarsening, introduces physically consistent small-scale variability, and achieves competitive probabilistic forecast skill at 0.25° resolution relative to an operational ensemble baseline, while requiring only a modest additional training cost compared with end-to-end high-resolution forecasting.

42. 【2604.00890】Beyond Symbolic Solving: Multi Chain-of-Thought Voting for Geometric Reasoning in Large Language Models

链接https://arxiv.org/abs/2604.00890

作者:Md. Abu Bakor Siddique,Shahrin Hossain,Sadman Ahmed Siam,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Geometric Problem Solving, Geometric Problem, enhancing mathematical reasoning, large language models, heart of enhancing

备注: Under review, 4 figures, 7 tables

点击查看摘要

Abstract:Geometric Problem Solving (GPS) remains at the heart of enhancing mathematical reasoning in large language models because it requires the combination of diagrammatic understanding, symbolic manipulation and logical inference. In existing literature, researchers have chiefly focused on synchronising the diagram descriptions with text literals and solving the problem. In this vein, they have either taken a neural, symbolic or neuro-symbolic approach. But this solves only the first two of the requirements, namely diagrammatic understanding and symbolic manipulation, while leaving logical inference underdeveloped. The logical inference is often limited to one chain-of-thought (CoT). To address this weakness in hitherto existing models, this paper proposes MARS-GPS, that generates multiple parallel reasoning rollouts augmented with Python code execution for numerical verification, ranks them using token-level entropy as a confidence signal, and aggregates answers through a multi-stage voting and self-verification pipeline. Empirical results show that MARS-GPS with 8 parallel rollouts achieves 88.8% on Geometry3K, a nearly +11% improvement over the prior state-of-the-art, with accuracy scaling consistently as the number of rollouts increases from 1 to 16 (+6.0% on ablation subset). We provide our code and data in an anonymous repository: this https URL.

43. 【2604.00887】Adversarial Attenuation Patch Attack for SAR Object Detection

链接https://arxiv.org/abs/2604.00887

作者:Yiming Zhang,Weibo Qin,Feng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Deep neural networks, Deep neural, demonstrated excellent performance, SAR target detection, Adversarial Attenuation Patch

备注: 5 pages, 4 figures. Source code is available at [this https URL](https://github.com/boremycin/SAAP)

点击查看摘要

Abstract:Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at this https URL.

44. 【2604.00886】PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

链接https://arxiv.org/abs/2604.00886

作者:Nan Wang,Zhiwei Jin,Chen Chen,Haonan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:heavy computational burden, impose exceptionally heavy, exceptionally heavy computational, elements demand high-resolution, demand high-resolution inputs

备注

点击查看摘要

Abstract:Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($\tau{=}0$) as well as controlled lossy compression ($\tau{}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at this https URL.

45. 【2604.00867】A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

链接https://arxiv.org/abs/2604.00867

作者:Maximilian Fehrentz,Nicolas Stellwag,Robert Wiebe,Nicole Thorisch,Fabian Grob,Patrick Remerscheid,Ken-Joel Simmoteit,Benjamin D. Killeen,Christian Heiliger,Nassir Navab

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intelligent assistive systems, soft tissue surgery, autonomous robotics, fundamental capability, capability for artificial

备注

点击查看摘要

Abstract:Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be "assembled" from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at this https URL

46. 【2604.00862】Shape Representation using Gaussian Process mixture models

链接https://arxiv.org/abs/2604.00862

作者:Panagiotis Sapoutzoglou,George Terzakis,Georgios Floros,Maria Pateraki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demand significant storage, fine geometric details, Traditional explicit, making functional representations, capture fine geometric

备注: To appear in ISPRS 2026

点击查看摘要

Abstract:Traditional explicit 3D representations, such as point clouds and meshes, demand significant storage to capture fine geometric details and require complex indexing systems for surface lookups, making functional representations an efficient, compact, and continuous alternative. In this work, we propose a novel, object-specific functional shape representation that models surface geometry with Gaussian Process (GP) mixture models. Rather than relying on computationally heavy neural architectures, our method is lightweight, leveraging GPs to learn continuous directional distance fields from sparsely sampled point clouds. We capture complex topologies by anchoring local GP priors at strategic reference points, which can be flexibly extracted using any structural decomposition method (e.g. skeletonization, distance-based clustering). Extensive evaluations on the ShapeNetCore and IndustryShapes datasets demonstrate that our method can efficiently and accurately represent complex geometries.

47. 【2604.00857】Sparkle: A Robust and Versatile Representation for Point Cloud based Human Motion Capture

链接https://arxiv.org/abs/2604.00857

作者:Yiming Ren,Yujing Sun,Aoru Xue,Kwok-Yan Lam,Yuexin Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unstructured point clouds, point clouds remains, Point cloud-based motion, leverages rich spatial, capture leverages rich

备注: Accepted at ICLR 2026

点击查看摘要

Abstract:Point cloud-based motion capture leverages rich spatial geometry and privacy-preserving sensing, but learning robust representations from noisy, unstructured point clouds remains challenging. Existing approaches face a struggle trade-off between point-based methods (geometrically detailed but noisy) and skeleton-based ones (robust but oversimplified). We address the fundamental challenge: how to construct an effective representation for human motion capture that can balance expressiveness and robustness. In this paper, we propose Sparkle, a structured representation unifying skeletal joints and surface anchors with explicit kinematic-geometric factorization. Our framework, SparkleMotion, learns this representation through hierarchical modules embedding geometric continuity and kinematic constraints. By explicitly disentangling internal kinematic structure from external surface geometry, SparkleMotion achieves state-of-the-art performance not only in accuracy but crucially in robustness and generalization under severe domain shifts, noise, and occlusion. Extensive experiments demonstrate our superiority across diverse sensor types and challenging real-world scenarios.

48. 【2604.00854】Perturb-and-Restore: Simulation-driven Structural Augmentation Framework for Imbalance Chromosomal Anomaly Detection

链接https://arxiv.org/abs/2604.00854

作者:Yilan Zhang,Hanbiao Chen,Changchun Yang,Yuetan Chu,Siyuan Chen,Jing Wu,Jingdong Hu,Na Li,Junkai Su,Yuxuan Chen,Ao Xu,Xin Gao,Aihua Yin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Detecting structural chromosomal, Detecting structural, genetic disorders, abnormalities is crucial, crucial for accurate

备注: This preprint version of the manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics (JBHI) for review

点击查看摘要

Abstract:Detecting structural chromosomal abnormalities is crucial for accurate diagnosis and management of genetic disorders. However, collecting sufficient structural abnormality data is extremely challenging and costly in clinical practice, and not all abnormal types can be readily collected. As a result, deep learning approaches face significant performance degradation due to the severe imbalance and scarcity of abnormal chromosome data. To address this challenge, we propose a Perturb-and-Restore (PR), a simulation-driven structural augmentation framework that effectively alleviates data imbalance in chromosome anomaly detection. The PR framework comprises two key components: (1) Structure Perturbation and Restoration Simulation, which generates synthetic abnormal chromosomes by perturbing chromosomal banding patterns of normal chromosomes followed by a restoration diffusion network that reconstructs continuous chromosome content and edges, thus eliminating reliance on rare abnormal samples; and (2) Energy-guided Adaptive Sampling, an energy score-based online selection strategy that dynamically prioritizes high-quality synthetic samples by referencing the energy distribution of real samples. To evaluate our method, we construct a comprehensive structural anomaly dataset consisting of over 260,000 chromosome images, including 4,242 abnormal samples spanning 24 categories. Experimental results demonstrate that the PR framework achieves state-of-the-art (SOTA) performance, surpassing existing methods with an average improvement of 8.92% in sensitivity, 8.89% in precision, and 13.79% in F1-score across all categories.

49. 【2604.00853】MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer

链接https://arxiv.org/abs/2604.00853

作者:Samuel Teodoro,Yun Chen,Agus Gunawan,Soo Ye Kim,Jihyong Oh,Munchurl Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transferring temporal dynamics, transfer enables controllable, existing Diffusion Transformer, enables controllable video, Motion transfer enables

备注: Please visit our project page at [this https URL](https://kaist-viclab.github.io/motiongrounder-site/)

点击查看摘要

Abstract:Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)-based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.

50. 【2604.00849】Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation

链接https://arxiv.org/abs/2604.00849

作者:Shuang Li,Chao Deng,Hang Chen,Liqun Liu,Zhenyu Hu,Te Cao,Mengge Xue,Yuan Chen,Peng Shu,Huan Yu,Jie Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generation aims, subject, aims to preserve, text prompt, subject identity

备注: Accepted by CVPR 2026 (Main)

点击查看摘要

Abstract:Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the "similarity-controllability paradox", where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.

51. 【2604.00829】LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

链接https://arxiv.org/abs/2604.00829

作者:Patrick Amadeus Irawan,Erland Hilman Fuadi,Shanu Kumar,Alham Fikri Aji,Yova Kementchedjhieva

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Adapting pretrained language, cross-modal interference introduced, Adapting pretrained, degrade their native, shift and cross-modal

备注

点击查看摘要

Abstract:Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

52. 【2604.00827】Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

链接https://arxiv.org/abs/2604.00827

作者:Patrick Glandorf,Thomas Norrenbrock,Bodo Rosenhahn

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, high computational costs, computational costs hinders, practical deployment, high computational

备注: CVPR'26 Workshops

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image-based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high-sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube-VIS 2021 dataset.

53. 【2604.00820】Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis

链接https://arxiv.org/abs/2604.00820

作者:Xingxing Weng,Ruifeng Ni,Chao Pang,XiangYu Hao,Yishan Wang,Xiaokang Zhang,Wei Xu,Gui-Song Xia

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrate impressive performance, static training data, accommodate continuously emerging, Current remote sensing, Current remote

备注: 23 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.

54. 【2604.00817】Multicentric thrombus segmentation using an attention-based recurrent network with gradual modality dropout

链接https://arxiv.org/abs/2604.00817

作者:Sofia Vargas-Ibarra,Vincent Vigneron,Hichem Maaref,Sonia Garcia-Salicetti

类目:Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:http URL ischemic, dataintroduce domain shifts, URL ischemic stroke, Detecting and delineating, restriction on DWI

备注

点击查看摘要

Abstract:Detecting and delineating tiny targets in 3D brain scans is a central yet under-addressed challenge in medical this http URL ischemic stroke, for instance, the culprit thrombus is small, low-contrast, and variably expressed across modalities(e.g., susceptibility-weighted T2 blooming, diffusion restriction on DWI/ADC), while real-world multi-center dataintroduce domain shifts, anisotropy, and frequent missing sequences. We introduce a methodology that couples an attention-based recurrent segmentation network (UpAttLLSTM), a training schedule that progressively increases the difficulty of hetero-modal learning, with gradual modality dropout, UpAttLLSTM aggregates context across slices via recurrent units (2.5D) and uses attention gates to fuse complementary cues across available sequences, making it robust to anisotropy and class imbalance. Gradual modality dropout systematically simulates site heterogeneity,noise, and missing modalities during training, acting as both augmentation and regularization to improve multi-center generalization. On a monocentric cohort, our approach detects thrombi in 90% of cases with a Dice score of 0.65. In a multi-center setting with missing modalities, it achieves-80% detection with a Dice score around 0.35. Beyond stroke, the proposed methodology directly transfers to other small-lesion tasks in 3D medical imaging where targets are scarce, subtle, and modality-dependent

55. 【2604.00813】DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

链接https://arxiv.org/abs/2604.00813

作者:Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Hanbing Li,Long Chen,Zhi-Xin Yang,Jiwen Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:learning language descriptions, conventional paradigm based, autonomous driving, based on sparse, sparse perception

备注: Code is available at \href{ [this https URL](https://github.com/wzzheng/DVGT) }

点击查看摘要

Abstract:End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

56. 【2604.00809】Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

链接https://arxiv.org/abs/2604.00809

作者:Kawtar Zaher,Olivier Buisson,Alexis Joly

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:Building on existing, iteratively retrieving images, user Relevance Feedback, existing approaches, consists of iteratively

备注

点击查看摘要

Abstract:Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

57. 【2604.00804】Compact Keyframe-Optimized Multi-Agent Gaussian Splatting SLAM

链接https://arxiv.org/abs/2604.00804

作者:Monica M.Q. Li,Pierre-Yves Lajoie,Jialiang Liu,Giovanni Beltrame

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:hinder real-time exchange, constrained communication links, dense representations hinder, representations hinder real-time, robotic teams operating

备注

点击查看摘要

Abstract:Efficient multi-agent 3D mapping is essential for robotic teams operating in unknown environments, but dense representations hinder real-time exchange over constrained communication links. In multi-agent Simultaneous Localization and Mapping (SLAM), systems typically rely on a centralized server to merge and optimize the local maps produced by individual agents. However, sharing these large map representations, particularly those generated by recent methods such as Gaussian Splatting, becomes a bottleneck in real-world scenarios with limited bandwidth. We present an improved multi-agent RGB-D Gaussian Splatting SLAM framework that reduces communication load while preserving map fidelity. First, we incorporate a compaction step into our SLAM system to remove redundant 3D Gaussians, without degrading the rendering quality. Second, our approach performs centralized loop closure computation without initial guess, operating in two modes: a pure rendered-depth mode that requires no data beyond the 3D Gaussians, and a camera-depth mode that includes lightweight depth images for improved registration accuracy and additional Gaussian pruning. Evaluation on both synthetic and real-world datasets shows up to 85-95\% reduction in transmitted data compared to state-of-the-art approaches in both modes, bringing 3D Gaussian multi-agent SLAM closer to practical deployment in real-world scenarios. Code: this https URL

58. 【2604.00799】Multimodal Language Models Cannot Spot Spatial Inconsistencies

链接https://arxiv.org/abs/2604.00799

作者:Om Khangaonkar,Hadi J. Rad,Hamed Pirsiavash

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:understand physical reality, Spatial consistency, fundamental property, key requirement, aim to understand

备注

点击查看摘要

Abstract:Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

59. 【2604.00792】HICT: High-precision 3D CBCT reconstruction from a single X-ray

链接https://arxiv.org/abs/2604.00792

作者:Wen Ma,Jiaxiang Liu,Zikai Xiao,Ziyang Wang,Feng Yang,Zuozhu Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high radiation dose, CBCT high radiation, dental imaging, treatment planning, limit its accessibility

备注

点击查看摘要

Abstract:Accurate 3D dental imaging is vital for diagnosis and treatment planning, yet CBCT's high radiation dose and cost limit its accessibility. Reconstructing 3D volumes from a single low-dose panoramic X-ray is a promising alternative but remains challenging due to geometric inconsistencies and limited accuracy. We propose HiCT, a two-stage framework that first generates geometrically consistent multi-view projections from a single panoramic image using a video diffusion model, and then reconstructs high-fidelity CBCT from the projections using a ray-based dynamic attention network and an X-ray sampling strategy. To support this, we built XCT, a large-scale dataset combining public CBCT data with 500 paired PX-CBCT cases. Extensive experiments show that HiCT achieves state-of-the-art performance, delivering accurate and geometrically consistent reconstructions for clinical use.

60. 【2604.00784】An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

链接https://arxiv.org/abs/2604.00784

作者:Lennart Maack,Alexander Schlaefer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advancing Computer-Assisted Surgery, Computer-Assisted Surgery, crucial prerequisite, prerequisite for advancing, advancing Computer-Assisted

备注

点击查看摘要

Abstract:Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

61. 【2604.00779】Using predefined vector systems to speed up neural network multimillion class classification

链接https://arxiv.org/abs/2604.00779

作者:Nikita Gabdullin,Ilya Androsov

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Label prediction, neural networks, proposed method, label prediction complexity, Label

备注: 12 pages, 2 figures, 3 tables, 2 algorithms, 1 theorem, 1 lemma

点击查看摘要

Abstract:Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can be significantly reduced. This is achieved by associating label prediction with the O(1) complexity closest cluster center search in a vector system used as target for latent space configuration (LSC). The proposed method only requires finding indexes of several largest and lowest values in the embedding vector making it extremely computationally efficient. We show that the proposed method does not change NN training accuracy computational results. We also measure the time required by different computational stages of NN inference and label prediction on multiple datasets. The experiments show that the proposed method allows to achieve up to 11.6 times overall acceleration over conventional methods. Furthermore, the proposed method has unique properties which allow to predict the existence of new classes.

62. 【2604.00761】PrivHAR-Bench: A Graduated Privacy Benchmark Dataset for Video-Based Action Recognition

链接https://arxiv.org/abs/2604.00761

作者:Samar Ansari

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Existing research, typically evaluates methods, typically evaluates, binary paradigm, single privacy transformation

备注

点击查看摘要

Abstract:Existing research on privacy-preserving Human Activity Recognition (HAR) typically evaluates methods against a binary paradigm: clear video versus a single privacy transformation. This limits cross-method comparability and obscures the nuanced relationship between privacy strength and recognition utility. We introduce \textit{PrivHAR-Bench}, a multi-tier benchmark dataset designed to standardize the evaluation of the \textit{Privacy-Utility Trade-off} in video-based action recognition. PrivHAR-Bench applies a graduated spectrum of visual privacy transformations: from lightweight spatial obfuscation to cryptographic block permutation, to a curated subset of 15 activity classes selected for human articulation diversity. Each of the 1,932 source videos is distributed across 9 parallel tiers of increasing privacy strength, with additional background-removed variants to isolate the contribution of human motion features from contextual scene bias. We provide lossless frame sequences, per-frame bounding boxes, estimated pose keypoints with joint-level confidence scores, standardized group-based train/test splits, and an evaluation toolkit computing recognition accuracy and privacy metrics. Empirical validation using R3D-18 demonstrates a measurable and interpretable degradation curve across tiers, with within-tier accuracy declining from 88.8\% (clear) to 53.5\% (encrypted, background-removed) and cross-domain accuracy collapsing to 4.8\%, establishing PrivHAR-Bench as a controlled benchmark for comparing privacy-preserving HAR methods under standardized conditions. The dataset, generation pipeline, and evaluation code are publicly available.

63. 【2604.00757】IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

链接https://arxiv.org/abs/2604.00757

作者:Dong-Jae Lee,Sunghyun Baek,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision Language, Vision Language Models, Language Models show, Models show impressive, Large Vision

备注

点击查看摘要

Abstract:Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

64. 【2604.00725】A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR

链接https://arxiv.org/abs/2604.00725

作者:Merveilles Agbeti-messan,Thierry Paquet,Clément Chatelain,Pierrick Tranouez,Stéphane Nicolas

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:handle long text, long text sequences, newspapers remains challenging, degraded print quality, remains challenging

备注

点击查看摘要

Abstract:End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Bibliothèque nationale du Luxembourg, with newly released 99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2604.00725 [cs.CV]

(or
arXiv:2604.00725v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.00725

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Merveilles Agbeti-Messan [view email] [v1]
Wed, 1 Apr 2026 10:33:33 UTC (470 KB)

65. 【2604.00696】A-Vid: Generalized Test-Time Adaptation for Video Reasoning

链接https://arxiv.org/abs/2604.00696

作者:Soumya Shamarao Jahagirdar,Edson Araujo,Anna Kukleva,M. Jehanzeb Mirza,Saurabhchand Bhati,Samuel Thomas,Brian Kingsbury,Rogerio Feris,James R. Glass,Hilde Kuehne

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown strong results, Recent video reasoning, multi-stage training pipelines, Recent video, making them costly

备注

点击查看摘要

Abstract:Recent video reasoning models have shown strong results on temporal and multimodal understanding, yet they depend on large-scale supervised data and multi-stage training pipelines, making them costly to train and difficult to adapt to new domains. In this work, we leverage the paradigm of Test-Time Reinforcement Learning on video-language data to allow for adapting a pretrained model to incoming video samples at test-time without explicit labels. The proposed test-time adaptation for video approach (TTA-Vid) combines two components that work simultaneously: (1) a test-time adaptation that performs step-by-step reasoning at inference time on multiple frame subsets. We then use a batch-aware frequency-based reward computed across different frame subsets as pseudo ground truth to update the model. It shows that the resulting model trained on a single batch or even a single sample from a dataset, is able to generalize at test-time to the whole dataset and even across datasets. Because the adaptation occurs entirely at test time, our method requires no ground-truth annotations or dedicated training splits. Additionally, we propose a multi-armed bandit strategy for adaptive frame selection that learns to prioritize informative frames, guided by the same reward formulation. Our evaluation shows that TTA-Vid yields consistent improvements across various video reasoning tasks and is able to outperform current state-of-the-art methods trained on large-scale data. This highlights the potential of test-time reinforcement learning for temporal multimodal understanding.

66. 【2604.00684】P-Seg: Task-Prototype Framework for Unified Medical Lesion Segmentation

链接https://arxiv.org/abs/2604.00684

作者:Jiawei Xu,Qiangqiang Zhou,Dandan Zhu,Yong Chen,Yugen Yi,Xiaoqi Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:medical lesion segmentation, efficiently handle diverse, AI-assisted diagnosis, lesion segmentation, single set

备注

点击查看摘要

Abstract:Building a unified model with a single set of parameters to efficiently handle diverse types of medical lesion segmentation has become a crucial objective for AI-assisted diagnosis. Existing unified segmentation approaches typically rely on shared encoders across heterogeneous tasks and modalities, which often leads to feature entanglement, gradient interference, and suboptimal lesion discrimination. In this work, we propose TP-Seg, a task-prototype framework for unified medical lesion segmentation. On one hand, the task-conditioned adapter effectively balances shared and task-specific representations through a dual-path expert structure, enabling adaptive feature extraction across diverse medical imaging modalities and lesion types. On the other hand, the prototype-guided task decoder introduces learnable task prototypes as semantic anchors and employs a cross-attention mechanism to achieve fine-grained modeling of task-specific foreground and background semantics. Without bells and whistles, TP-Seg consistently outperforms specialized, general and unified segmentation methods across 8 different medical lesion segmentation tasks covering multiple imaging modalities, demonstrating strong generalization, scalability and clinical applicability.

67. 【2604.00682】MoonAnything: A Vision Benchmark with Large-Scale Lunar Supervised Data

链接https://arxiv.org/abs/2604.00682

作者:Clémentine Grethen,Yuang Shi,Simone Gasparini,Géraldine Morin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lunar exploration missions, modern lunar exploration, Accurate perception, exploration missions, surfaces is critical

备注: Accepted to ACM MMSys 2026

点击查看摘要

Abstract:Accurate perception of lunar surfaces is critical for modern lunar exploration missions. However, developing robust learning-based perception systems is hindered by the lack of datasets that provide both geometric and photometric supervision. Existing lunar datasets typically lack either geometric ground truth, photometric realism, illumination diversity, or large-scale coverage. In this paper, we introduce MoonAnything, a unified benchmark built on real lunar topography with physically-based rendering, providing the first comprehensive geometric and photometric supervision under diverse illumination with large scale. The benchmark comprises two complementary sub-datasets : i) LunarGeo provides stereo images with corresponding dense depth maps and camera calibration enabling 3D reconstruction and pose estimation; ii) LunarPhoto provides photorealistic images using a spatially-varying BRDF model, along with multi-illumination renderings under real solar configurations, enabling reflectance estimation and illumination-robust perception. Together, these datasets offer over 130K samples with comprehensive supervision. Beyond lunar applications, MoonAnything offers a unique setting and challenging testbed for algorithms under low-textured, high-contrast conditions and applies to other airless celestial bodies and could generalize beyond. We establish baselines using state-of-the-art methods and release the complete dataset along with generation tools to support community extension: this https URL.

68. 【2604.00677】CL-VISTA: Benchmarking Continual Learning in Video Large Language Models

链接https://arxiv.org/abs/2604.00677

作者:Haiyang Guo,Yichen Shi,Fei Zhu,Wenzhuo Liu,Hongbo Zhao,Fanhu Zeng,Shijie Ma,Da-Han Wang,Xu-Yao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Video Large Language, Large Language, non-stationary real-world data, Language Models

备注: Preprint

点击查看摘要

Abstract:Video Large Language Models (Video-LLMs) require continual learning to adapt to non-stationary real-world data. However, existing benchmarks fall short of evaluating modern foundation models: many still rely on models without large-scale pre-training, and prevailing benchmarks typically partition a single dataset into sub-tasks, resulting in high task redundancy and negligible forgetting on pre-trained Video-LLMs. To address these limitations, we propose CL-VISTA, a benchmark tailored for continual video understanding of Video-LLMs. By curating 8 diverse tasks spanning perception, understanding, and reasoning, CL-VISTA induces substantial distribution shifts that effectively expose catastrophic forgetting. To systematically assess CL methods, we establish a comprehensive evaluation framework comprising 6 distinct protocols across 3 critical dimensions: performance, computational efficiency, and memory footprint. Notably, the performance dimension incorporates a general video understanding assessment to assess whether CL methods genuinely enhance foundational intelligence or merely induce task-specific overfitting. Extensive benchmarking of 10 mainstream CL methods reveals a fundamental trade-off: no single approach achieves universal superiority across all dimensions. Methods that successfully mitigate catastrophic forgetting tend to compromise generalization or incur prohibitive computational and memory overheads. We hope CL-VISTA provides critical insights for advancing continual learning in multimodal foundation models.

69. 【2604.00651】When AI and Experts Agree on Error: Intrinsic Ambiguity in Dermatoscopic Images

链接https://arxiv.org/abs/2604.00651

作者:Loris Cino,Pier Luigi Mazzeo,Alessandro Martella,Giulia Radi,Renato Rossi,Cosimo Distante

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Neural Networks, Convolutional Neural, substantial clinical potential, dermatological diagnosis demonstrates

备注

点击查看摘要

Abstract:The integration of artificial intelligence (AI), particularly Convolutional Neural Networks (CNNs), into dermatological diagnosis demonstrates substantial clinical potential. While existing literature predominantly benchmarks algorithmic performance against human experts, our study adopts a novel perspective by investigating the intrinsic complexity of dermatoscopic images. Through rigorous experimentation with multiple CNN architectures, we isolated a subset of images systematically misclassified across all models-a phenomenon statistically proven to exceed random chance. To determine if these failures stem from algorithmic biases or inherent visual ambiguity, expert dermatologists independently evaluated these challenging cases alongside a control group. The results revealed a collapse in human diagnostic performance on the AI-misclassified images. First, agreement with ground-truth labels plummeted, with Cohen's kappa dropping to a mere 0.08 for the difficult images, compared to a 0.61 for the control group. Second, we observed a severe deterioration in expert consensus; inter-rater reliability among physicians fell from moderate concordance (Fleiss kappa = 0.456) on control images to only modest agreement (Fleiss kappa = 0.275) on difficult cases. We identified image quality as a primary driver of these dual systematic failures. To promote transparency and reproducibility, all data, code, and trained models have been made publicly available

70. 【2604.00648】DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

链接https://arxiv.org/abs/2604.00648

作者:Zhengxian Yang,Fei Xie,Xutao Xue,Rui Zhang,Taicheng Huang,Yang Liu,Mengqi Ji,Tao Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, high-fidelity rendering, enabled efficient, greatly advancing, Splatting

备注: CVPR 2026

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets.

71. 【2604.00634】LiPS: Lightweight Panoptic Segmentation for Resource-Constrained Robotics

链接https://arxiv.org/abs/2604.00634

作者:Calvin Galagain,Martyna Poreba,François Goulette,Cyrill Stachniss

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:unifies semantic understanding, object-level reasoning, key enabler, unifies semantic, semantic understanding

备注: Submitted to IEEE ICIP 2026. Under review

点击查看摘要

Abstract:Panoptic segmentation is a key enabler for robotic perception, as it unifies semantic understanding with object-level reasoning. However, the increasing complexity of state-of-the-art models makes them unsuitable for deployment on resource-constrained platforms such as mobile robots. We propose a novel approach called LiPS that addresses the challenge of efficient-to-compute panoptic segmentation with a lightweight design that retains query-based decoding while introducing a streamlined feature extraction and fusion pathway. It aims at providing a strong panoptic segmentation performance while substantially lowering the computational demands. Evaluations on standard benchmarks demonstrate that LiPS attains accuracy comparable to much heavier baselines, while providing up to 4.5 higher throughput, measured in frames per second, and requiring nearly 6.8 times fewer computations. This efficiency makes LiPS a highly relevant bridge between modern panoptic models and real-world robotic applications.

72. 【2604.00609】ALENT: Target-aware Efficient Tuning for Referring Image Segmentation

链接https://arxiv.org/abs/2604.00609

作者:Shuo Jin,Siyue Yu,Bingfeng Zhang,Chao Yao,Meiqin Liu,Jimin Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring image segmentation, image segmentation aims, Referring image, natural text expression, segment specific targets

备注: Accepted by CVPR26 Findings

点击查看摘要

Abstract:Referring image segmentation aims to segment specific targets based on a natural text expression. Recently, parameter-efficient tuning (PET) has emerged as a promising paradigm. However, existing PET-based methods often suffer from the fact that visual features can't emphasize the text-referred target instance but activate co-category yet unrelated objects. We analyze and quantify this problem, terming it the `non-target activation' (NTA) issue. To address this, we propose a novel framework, TALENT, which utilizes target-aware efficient tuning for PET-based RIS. Specifically, we first propose a Rectified Cost Aggregator (RCA) to efficiently aggregate text-referred features. Then, to calibrate `NTA' into accurate target activation, we adopt a Target-aware Learning Mechanism (TLM), including contextual pairwise consistency learning and target-centric contrastive learning. The former uses the sentence-level text feature to achieve a holistic understanding of the referent and constructs a text-referred affinity map to optimize the semantic association of visual features. The latter further enhances target localization to discover the distinct instance while suppressing associations with other unrelated ones. The two objectives work in concert and address `NTA' effectively. Extensive evaluations show that TALENT outperforms existing methods across various metrics (e.g., 2.5\% mIoU gains on G-Ref val set). Our codes will be released at: this https URL.

73. 【2604.00605】Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

链接https://arxiv.org/abs/2604.00605

作者:Daye Kang,Hyeongboo Baek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:detection count drops, adversarial attack assume, drops in tandem, primary tools, monitor and defend

备注: 14 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

74. 【2604.00601】KG-CMI: Knowledge graph enhanced cross-Mamba interaction for medical visual question answering

链接https://arxiv.org/abs/2604.00601

作者:Xianyao Zheng,Hong Yu,Hui Cui,Changming Sun,Xiangyu Li,Ran Su,Leyi Wei,Jia Zhou,Junbo Wang,Qiangguo Jin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial multimodal task, clinical decision support, visual question answering, support and telemedicine, Medical visual question

备注

点击查看摘要

Abstract:Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent methods fail to fully leverage domain-specific medical knowledge, making it difficult to accurately associate lesion features in medical images with key diagnostic criteria. Additionally, classification-based approaches typically rely on predefined answer sets. Treating Med-VQA as a simple classification problem limits its ability to adapt to the diversity of free-form answers and may overlook detailed semantic information in those answers. To address these challenges, we propose a knowledge graph enhanced cross-Mamba interaction (KG-CMI) framework, which consists of a fine-grained cross-modal feature alignment (FCFA) module, a knowledge graph embedding (KGE) module, a cross-modal interaction representation (CMIR) module, and a free-form answer enhanced multi-task learning (FAMT) module. The KG-CMI learns cross-modal feature representations for images and texts by effectively integrating professional medical knowledge through a graph, establishing associations between lesion features and disease knowledge. Moreover, FAMT leverages auxiliary knowledge from open-ended questions, improving the model's capability for open-ended Med-VQA. Experimental results demonstrate that KG-CMI outperforms existing state-of-the-art methods on three Med-VQA datasets, i.e., VQA-RAD, SLAKE, and OVQA. Additionally, we conduct interpretability experiments to further validate the framework's effectiveness.

75. 【2604.00597】owards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors

链接https://arxiv.org/abs/2604.00597

作者:Hiroki Hashimoto,Hiromichi Goto,Hiroyuki Sugai,Hiroshi Kera,Kazuhiko Kawamoto

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Robust trajectory planning, Robust trajectory, autonomous driving, important for scalable, trajectory planning

备注: Accepted at CVPR Workshop on Simulation for Autonomous Driving 2026

点击查看摘要

Abstract:Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

76. 【2604.00592】HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models

链接https://arxiv.org/abs/2604.00592

作者:Junhee Lee,Minseok Kim,Hwanjo Heo,Seungwon Woo,Jinwoo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Social Virtual Reality, Virtual Reality, platforms provide immersive, provide immersive social, immersive social experiences

备注: To appear in the 2026 TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

点击查看摘要

Abstract:Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.

77. 【2604.00559】FecalFed: Privacy-Preserving Poultry Disease Detection via Federated Learning

链接https://arxiv.org/abs/2604.00559

作者:Tien-Yu Chi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:global food security, Early detection, pathogenic avian influenza, food security, critical for global

备注: Accepted to the CVPR 2026 Workshop on Vision for Agriculture

点击查看摘要

Abstract:Early detection of highly pathogenic avian influenza (HPAI) and endemic poultry diseases is critical for global food security. While computer vision models excel at classifying diseases from fecal imaging, deploying these systems at scale is bottlenecked by farm data privacy concerns and institutional data silos. Furthermore, existing open-source agricultural datasets frequently suffer from severe, undocumented data contamination. In this paper, we introduce $\textbf{FecalFed}$, a privacy-preserving federated learning framework for poultry disease classification. We first curate and release $\texttt{poultry-fecal-fl}$, a rigorously deduplicated dataset of 8,770 unique images across four disease classes, revealing and eliminating a 46.89$\%$ duplication rate in popular public repositories. To simulate realistic agricultural environments, we evaluate FecalFed under highly heterogeneous, non-IID conditions (Dirichlet $\alpha=0.5$). While isolated single-farm training collapses under this data heterogeneity, yielding only 64.86$\%$ accuracy, our federated approach recovers performance without centralizing sensitive data. Specifically, utilizing server-side adaptive optimization (FedAdam) with a Swin-Small architecture achieves 90.31$\%$ accuracy, closely approaching the centralized upper bound of 95.10\%. Furthermore, we demonstrate that an edge-optimized Swin-Tiny model maintains highly competitive performance at 89.74$\%$, establishing a highly efficient, privacy-first blueprint for on-farm avian disease monitoring.

78. 【2604.00558】STAR: Mitigating Cascading Errors in Spatial Reasoning via Turn-point Alignment and Segment-level DPO

链接https://arxiv.org/abs/2604.00558

作者:Pukun Zhao,Longxiang Wang,Chen Chen,Peicheng Wang,Fanqing Zhou,Runze Li,Haojian Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Language, benchmark for Large, Structured spatial navigation, Structured spatial

备注: 9 pages, 6 figures, 4 tables, Accepted by ICME 2026

点击查看摘要

Abstract:Structured spatial navigation is a core benchmark for Large Language Models (LLMs) spatial reasoning. Existing paradigms like Visualization-of-Thought (VoT) are prone to cascading errors in complex topologies. To solve this, we propose STAR, a two-stage framework grounded on topological anchors, and introduce the RedMaze-23K dataset with human-inspired turnpoint annotations. The first stage uses supervised fine-tuning to help models internalize spatial semantics and prune redundant paths. The second adopts Spatial-aware Segment-level Direct Preference Optimization (SDPO) to refine self-correction in long-horizon navigation. Experiments show STAR achieves state-of-the-art performance among open-source models: its 32B variant outperforms DeepSeek-V3 (29.27% vs. 25.00%) and reaches 82.4% of GPT-4's performance.

79. 【2604.00557】Multi-Camera View Scaling for Data-Efficient Robot Imitation Learning

链接https://arxiv.org/abs/2604.00557

作者:Yichen Xie,Yixiao Wang,Shuqi Zhao,Cheng-En Wu,Masayoshi Tomizuka,Jianwen Xie,Hao-Shu Fang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:difficult in practice, fundamentally constrained, varied environments, environments is costly, costly and difficult

备注

点击查看摘要

Abstract:The generalization ability of imitation learning policies for robotic manipulation is fundamentally constrained by the diversity of expert demonstrations, while collecting demonstrations across varied environments is costly and difficult in practice. In this paper, we propose a practical framework that exploits inherent scene diversity without additional human effort by scaling camera views during demonstration collection. Instead of acquiring more trajectories, multiple synchronized camera perspectives are used to generate pseudo-demonstrations from each expert trajectory, which enriches the training distribution and improves viewpoint invariance in visual representations. We analyze how different action spaces interact with view scaling and show that camera-space representations further enhance diversity. In addition, we introduce a multiview action aggregation method that allows single-view policies to benefit from multiple cameras during deployment. Extensive experiments in simulation and real-world manipulation tasks demonstrate significant gains in data efficiency and generalization compared to single-view baselines. Our results suggest that scaling camera views provides a practical and scalable solution for imitation learning, which requires minimal additional hardware setup and integrates seamlessly with existing imitation learning algorithms. The website of our project is this https URL.

80. 【2604.00549】F-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

链接https://arxiv.org/abs/2604.00549

作者:Zhijin He,Shuo Jin,Siyue Yu,Shuwei Wu,Bingfeng Zhang,Li Yu,Jimin Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Co-salient Object Detection, Co-salient Object, Object Detection, segment salient objects, Vision Foundation Models

备注: Accepted by CVPR26

点击查看摘要

Abstract:Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at this https URL.

81. 【2604.00548】Reliev3R: Relieving Feed-forward Reconstruction from Multi-View Geometric Annotations

链接https://arxiv.org/abs/2604.00548

作者:Youyu Chen,Junjun Jiang,Yueru Luo,Kui Jiang,Xianming Liu,Xu Yan,Dave Zhenyu Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple downstream tasks, demonstrated great potential, Feed-forward Reconstruction Models, Feed-forward Reconstruction, recent advances

备注: Accepted by CVPR2026

点击查看摘要

Abstract:With recent advances, Feed-forward Reconstruction Models (FFRMs) have demonstrated great potential in reconstruction quality and adaptiveness to multiple downstream tasks. However, the excessive reliance on multi-view geometric annotations, e.g. 3D point maps and camera poses, makes the fully-supervised training scheme of FFRMs difficult to scale up. In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. Relieving the reliance on geometric sensory data and compute-exhaustive structure-from-motion preprocessing, our method draws 3D knowledge directly from monocular relative depths and image sparse correspondences given by zero-shot predictions of pretrained models. At the core of Reliev3R, we design an ambiguity-aware relative depth loss and a trigonometry-based reprojection loss to facilitate supervision for multi-view geometric consistency. Training from scratch with the less data, Reliev3R catches up with its fully-supervised sibling models, taking a step towards low-cost 3D reconstruction supervisions and scalable FFRMs.

82. 【2604.00545】Neuropsychiatric Deviations From Normative Profiles: An MRI-Derived Marker for Early Alzheimer's Disease Detection

链接https://arxiv.org/abs/2604.00545

作者:Synne Hjertager Osenbroch,Lisa Ramona Rosvold,Yao Lu,Alvaro Fernandez-Quilez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precede cognitive decline, cognitive decline, Alzheimer Disease Neuroimaging, Alzheimer disease, depression and apathy

备注: Accepted and to be presented (ORAL) in ISBI 2026

点击查看摘要

Abstract:Neuropsychiatric symptoms (NPS) such as depression and apathy are common in Alzheimer's disease (AD) and often precede cognitive decline. NPS assessments hold promise as early detection markers due to their correlation with disease progression and their non-invasive nature. Yet current tools cannot distinguish whether NPS are part of aging or early signs of AD, limiting their utility. We present a deep learning-based normative modelling framework to identify atypical NPS burden from structural MRI. A 3D convolutional neural network was trained on cognitively stable participants from the Alzheimer's Disease Neuroimaging Initiative, learning the mapping between brain anatomy and Neuropsychiatric Inventory Questionnaire (NPIQ) scores. Deviations between predicted and observed scores defined the Divergence from NPIQ scores (DNPI). Higher DNPI was associated with future AD conversion (adjusted OR=2.5; p 0.01) and achieved predictive accuracy comparable to cerebrospinal fluid AB42 (AUC=0.74 vs 0.75). Our approach supports scalable, non-invasive strategies for early AD detection.

83. 【2604.00538】RiGS: Temporal Rigid-Body Motion for Scalable 4D Gaussian Splatting

链接https://arxiv.org/abs/2604.00538

作者:Suwoong Yeom,Joonsik Nam,Seunggyu Choi,Lucas Yunkyu Lee,Sangmin Kim,Jaesik Park,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong,Sukju Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:piecewise linear velocity, impressive dynamic scene, dynamic scene reconstruction, Gaussian Splatting, short temporal windows

备注: Project page: [this https URL](https://wwwjjn.github.io/TRiGS-project_page/)

点击查看摘要

Abstract:Recent 4D Gaussian Splatting (4DGS) methods achieve impressive dynamic scene reconstruction but often rely on piecewise linear velocity approximations and short temporal windows. This disjointed modeling leads to severe temporal fragmentation, forcing primitives to be repeatedly eliminated and regenerated to track complex nonlinear dynamics. This makeshift approximation eliminates the long-term temporal identity of objects and causes an inevitable proliferation of Gaussians, hindering scalability to extended video sequences. To address this, we propose TRiGS, a novel 4D representation that utilizes unified, continuous geometric transformations. By integrating $SE(3)$ transformations, hierarchical Bezier residuals, and learnable local anchors, TRiGS models geometrically consistent rigid motions for individual primitives. This continuous formulation preserves temporal identity and effectively mitigates unbounded memory growth. Extensive experiments demonstrate that TRiGS achieves high fidelity rendering on standard benchmarks while uniquely scaling to extended video sequences (e.g., 600 to 1200 frames) without severe memory bottlenecks, significantly outperforming prior works in temporal stability.

84. 【2604.00537】MATHENA: Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy

链接https://arxiv.org/abs/2604.00537

作者:Kyeonghun Kim,Jaehyung Park,Youngung Han,Anna Jung,Seongbin Park,Sumin Lee,Jiwon Yang,Jiyoon Han,Subeen Lee,Junsu Lim,Hyunsu Go,Eunseob Choi,Hyeonseok Jung,Soo Yong Kim,Woo Kyoung Jeong,Won Jae Lee,Pa Hong,Hyuk-Jae Lee,Ken Ying-Kai Liao,Nam-Joon Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:dental developmental staging, diagnosis from Orthopantomograms, State Space Models, Holistic Evaluation Network, Mamba-based Architectural Tooth

备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Dental diagnosis from Orthopantomograms (OPGs) requires coordination of tooth detection, caries segmentation (CarSeg), anomaly detection (AD), and dental developmental staging (DDS). We propose Mamba-based Architectural Tooth Hierarchical Estimator and Holistic Evaluation Network for Anatomy (MATHENA), a unified framework leveraging Mamba's linear-complexity State Space Models (SSM) to address all four tasks. MATHENA integrates MATHE, a multi-resolution SSM-driven detector with four-directional Vision State Space (VSS) blocks for O(N) global context modeling, generating per-tooth crops. These crops are processed by HENA, a lightweight Mamba-UNet with a triple-head architecture and Global Context State Token (GCST). In the triple-head architecture, CarSeg is first trained as an upstream task to establish shared representations, which are then frozen and reused for downstream AD fine-tuning and DDS classification via linear probing, enabling stable, efficient learning. We also curate PARTHENON, a benchmark comprising 15,062 annotated instances from ten datasets. MATHENA achieves 93.78% mAP@50 in tooth detection, 90.11% Dice for CarSeg, 88.35% for AD, and 72.40% ACC for DDS.

85. 【2604.00534】FreqPhys: Repurposing Implicit Physiological Frequency Prior for Robust Remote Photoplethysmography

链接https://arxiv.org/abs/2604.00534

作者:Wei Qian,Dan Guo,Jinxing Zhou,Bochao Zou,Zitong Yu,Meng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:capturing subtle skin-color, subtle skin-color variations, Remote photoplethysmography, enables contactless physiological, contactless physiological monitoring

备注

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables contactless physiological monitoring by capturing subtle skin-color variations from facial videos. However, most existing methods predominantly rely on time-domain modeling, making them vulnerable to motion artifacts and illumination fluctuations, where weak physiological clues are easily overwhelmed by noise. To address these challenges, we propose FreqPhys, a frequency-guided rPPG framework that explicitly leverages physiological frequency priors for robust signal recovery. Specifically, FreqPhys first applies a Physiological Bandpass Filtering module to suppress out-of-band interference, and then performs Physiological Spectrum Modulation together with adaptive spectral selection to emphasize pulse-related frequency components while suppress residual in-band noise. A Cross-domain Representation Learning module further fuses these spectral priors with deep time-domain features to capture informative spatial--temporal dependencies. Finally, a frequency-aware conditional diffusion process progressively reconstructs high-fidelity rPPG signals. Extensive experiments on six benchmarks demonstrate that FreqPhys yields significant improvements over state-of-the-art approaches, particularly under challenging motion conditions. It highlights the importance of explicitly modeling physiological frequency priors. The source code will be released.

86. 【2604.00530】AceTone: Bridging Words and Colors for Conditional Image Grading

链接https://arxiv.org/abs/2604.00530

作者:Tianren Ma,Mingxiang Liao,Xijin Zhang,Qixiang Ye

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:interpret image style, style and emotion, color grading, Color, Color affects

备注: Accepted by CVPR 2026. Project Page: [this http URL](http://github.com/martian422/AceTone)

点击查看摘要

Abstract:Color affects how we interpret image style and emotion. Previous color grading methods rely on patch-wise recoloring or fixed filter banks, struggling to generalize across creative intents or align with human aesthetic preferences. In this study, we propose AceTone, the first approach that supports multimodal conditioned color grading within a unified framework. AceTone formulates grading as a generative color transformation task, where a model directly produces 3D-LUTs conditioned on text prompts or reference images. We develop a VQ-VAE based tokenizer which compresses a $3\times32^3$ LUT vector to 64 discrete tokens with $\Delta E2$ fidelity. We further build a large-scale dataset, AceTone-800K, and train a vision-language model to predict LUT tokens, followed by reinforcement learning to align outputs with perceptual fidelity and aesthetics. Experiments show that AceTone achieves state-of-the-art performance on both text-guided and reference-guided grading tasks, improving LPIPS by up to 50% over existing methods. Human evaluations confirm that AceTone's results are visually pleasing and stylistically coherent, demonstrating a new pathway toward language-driven, aesthetic-aligned color grading.

87. 【2604.00528】hink, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

链接https://arxiv.org/abs/2604.00528

作者:Haibo Wang,Zihao Lin,Zhiyang Xu,Lifu Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:natural language descriptions, aims to localize, scenes via natural, language descriptions, localize objects

备注

点击查看摘要

Abstract:3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the target across 2D frames. Crucially, to overcome the multi-view coverage deficit caused by strict VLM semantic tracking, we introduce the Semantic-Anchored Geometric Expansion, a mechanism that first anchors the target in a reference video clip and then leverages multi-view geometry to propagate its spatial location across unobserved frames. This enables the agent to "Build" the target's 3D representation by aggregating these multi-view features via camera parameters, directly mapping 2D visual cues to 3D coordinates. Furthermore, to ensure rigorous assessment, we identify flaws such as reference ambiguity and category errors in existing benchmarks and manually refine the incorrect queries. Extensive experiments on ScanRefer and Nr3D demonstrate that our framework, relying entirely on open-source models, significantly outperforms previous zero-shot methods and even surpasses fully supervised baselines.

88. 【2604.00519】Learnability-Guided Diffusion for Dataset Distillation

链接https://arxiv.org/abs/2604.00519

作者:Jeffrey A. Chan-Santiago,Mubarak Shah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Training machine learning, machine learning models, expensive and time-consuming, machine learning, Training

备注: This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through successive stages. Starting from a small set, we train a model and generate new samples guided by learnability scores that identify what the current model can learn from, creating an adaptive curriculum. We introduce Learnability-Guided Diffusion (LGD), which balances training utility for the current model with validity under a reference model to generate curriculum-aligned samples. Our approach reduces redundancy by 39.1%, promotes specialization across training stages, and achieves state-of-the-art results on ImageNet-1K (60.1%), ImageNette (87.2%), and ImageWoof (72.9%). Our code is available on our project page this https URL.

89. 【2604.00517】oward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

链接https://arxiv.org/abs/2604.00517

作者:Axiu Mao,Meilu Zhu,Lei Shen,Xiaoshuai Wang,Tomas Norton,Kai Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep learning techniques, livestock management efficiency, wearable sensor-aided animal, demonstrated promising performance, improving livestock management

备注: 26 pages, 14 figures

点击查看摘要

Abstract:With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.

90. 【2604.00514】MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

链接https://arxiv.org/abs/2604.00514

作者:Kyeonghun Kim,Hyeonseok Jung,Youngung Han,Junsu Lim,YeonJu Jean,Seongbin Park,Eunseob Choi,Hyunsu Go,SeoYoung Ju,Seohyoung Park,Gyeongmin Kim,MinJu Kwon,KyungSeok Yuh,Soo Yong Kim,Ken Ying-Kai Liao,Nam-Joon Kim,Hyuk-Jae Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Training deep learning, Computed Tomography, Training deep, deep learning models, models for three-dimensional

备注: 5 pages, 3 figures. Accepted at ICEIC 2026

点击查看摘要

Abstract:Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the 'superpatch', a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.

91. 【2604.00513】MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

链接https://arxiv.org/abs/2604.00513

作者:Junxian Wu,Chenghan Fu,Zhanheng Nie,Daoze Zhang,Bowen Wan,Wanxian Guan,Chuan Yu,Jian Xu,Bo Zheng

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:exploring general representations, attracted increasing attention, exploring general, rapid growth, attracted increasing

备注: 10 pages, 6 figures

点击查看摘要

Abstract:With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

92. 【2604.00509】RT-GS: Gaussian Splatting with Reflection and Transmittance Primitives

链接https://arxiv.org/abs/2604.00509

作者:Kunnong Zeng,Chensheng Peng,Yichen Xie,Masayoshi Tomizuka,Cem Yuksel

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:reconstructing diffuse scenes, Gaussian Splatting, simultaneously model specular, model specular reflections, model specular reflection

备注

点击查看摘要

Abstract:Gaussian Splatting is a powerful tool for reconstructing diffuse scenes, but it struggles to simultaneously model specular reflections and the appearance of objects behind semi-transparent surfaces. These specular reflections and transmittance are essential for realistic novel view synthesis, and existing methods do not properly incorporate the underlying physical processes to simulate them. To address this issue, we propose RT-GS, a unified framework that integrates a microfacet material model and ray tracing to jointly model specular reflection and transmittance in Gaussian Splatting. We accomplish this by using separate Gaussian primitives for reflections and transmittance, which allow modeling distant reflections and reconstructing objects behind transparent surfaces concurrently. We utilize a differentiable ray tracing framework to obtain the specular reflection and transmittance appearance. Our experiments demonstrate that our method successfully produces reflections and recovers objects behind transparent surfaces in complex environments, achieving significant qualitative improvements over prior methods where these specular light interactions are prominent.

93. 【2604.00507】RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection

链接https://arxiv.org/abs/2604.00507

作者:Jihwan Park,Chanhyeong Yang,Jinyoung Park,Taehoon Song,Hyunwoo J. Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly-supervised Human-Object Interaction, scalable scene understanding, Weakly-supervised Human-Object, detection is essential, scene understanding

备注: Accepted at CVPR2026

点击查看摘要

Abstract:Weakly-supervised Human-Object Interaction (HOI) detection is essential for scalable scene understanding, as it learns interactions from only image-level annotations. Due to the lack of localization signals, prior works typically rely on an external object detector to generate candidate pairs and then infer their interactions through pairwise reasoning. However, this framework often struggles to scale due to the substantial computational cost incurred by enumerating numerous instance pairs. In addition, it suffers from false positives arising from non-interactive combinations, which hinder accurate instance-level HOI reasoning. To address these issues, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module for efficient and accurate HOI reasoning. Under image-level supervision, RegFormer leverages spatially grounded signals as guidance for the reasoning process and promotes locality-aware interaction learning. By learning localized interaction cues, our module distinguishes humans, objects, and their interactions, enabling direct transfer from image-level interaction reasoning to precise and efficient instance-level reasoning without additional training. Our extensive experiments and analyses demonstrate that RegFormer effectively learns spatial cues for instance-level interaction reasoning, operates with high efficiency, and even achieves performance comparable to fully supervised models. Our code is available at this https URL.

94. 【2604.00503】PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training

链接https://arxiv.org/abs/2604.00503

作者:Weifu Fu,Jinyang Li,Bin-Bin Gao,Jialin Li,Yuhuan Lin,Hanqiu Deng,Wenbing Tao,Yong Liu,Chengjie Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rare categories, fixed classes, classes but faces, scarcity of image-text, image-text pairs

备注

点击查看摘要

Abstract:Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the development cycle. We introduce two prompt-enriched training strategies: Intra-Batch Parallel Prompting (IBP) at the iteration level and Dynamic Memory-Driven Prompting (DMD) at the overall training level. These strategies enable simultaneous modeling of multiple prompt routes, facilitating parallel alignment with diverse real-world usage scenarios. Comprehensive experiments demonstrate that PET-DINO exhibits competitive zero-shot object detection capabilities across various prompt-based detection protocols. These strengths can be attributed to inheritance-based philosophy and prompt-enriched training strategies, which play a critical role in building an effective generic object detector. Project page: this https URL.

95. 【2604.00495】PC-SAM: Patch-Constrained Fine-Grained Interactive Road Segmentation in High-Resolution Remote Sensing Images

链接https://arxiv.org/abs/2604.00495

作者:Chengcheng Lv,Rushi Li,Mincheng Wu,Xiufang Shi,Zhenyu Wen,Shibo He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fully automatic, segmentation, Road, fully automatic segmentation, road segmentation

备注

点击查看摘要

Abstract:Road masks obtained from remote sensing images effectively support a wide range of downstream tasks. In recent years, most studies have focused on improving the performance of fully automatic segmentation models for this task, achieving significant gains. However, current fully automatic methods are still insufficient for identifying certain challenging road segments and often produce false positive and false negative regions. Moreover, fully automatic segmentation does not support local segmentation of regions of interest or refinement of existing masks. Although the SAM model is widely used as an interactive segmentation model and performs well on natural images, it shows poor performance in remote sensing road segmentation and cannot support fine-grained local refinement. To address these limitations, we propose PC-SAM, which integrates fully automatic road segmentation and interactive segmentation within a unified framework. By carefully designing a fine-tuning strategy, the influence of point prompts is constrained to their corresponding patches, overcoming the inability of the original SAM to perform fine local corrections and enabling fine-grained interactive mask refinement. Extensive experiments on several representative remote sensing road segmentation datasets demonstrate that, when combined with point prompts, PC-SAM significantly outperforms state-of-the-art fully automatic models in road mask segmentation, while also providing flexible local mask refinement and local road segmentation. The code will be available at this https URL.

96. 【2604.00494】ARGS: Auto-Regressive Gaussian Splatting via Parallel Progressive Next-Scale Prediction

链接https://arxiv.org/abs/2604.00494

作者:Quanyuan Ruan,Kewei Shi,Jiabao Lei,Xifeng Gao,Xiaoguang Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong potential, images have demonstrated, coarse input, demonstrated strong, strong potential

备注

点击查看摘要

Abstract:Auto-regressive frameworks for next-scale prediction of 2D images have demonstrated strong potential for producing diverse and sophisticated content by progressively refining a coarse input. However, extending this paradigm to 3D object generation remains largely unexplored. In this paper, we introduce auto-regressive Gaussian splatting (ARGS), a framework for making next-scale predictions in parallel for generation according to levels of detail. We propose a Gaussian simplification strategy and reverse the simplification to guide next-scale generation. Benefiting from the use of hierarchical trees, the generation process requires only \(\mathcal{O}(\log n)\) steps, where \(n\) is the number of points. Furthermore, we propose a tree-based transformer to predict the tree structure auto-regressively, allowing leaf nodes to attend to their internal ancestors to enhance structural consistency. Extensive experiments demonstrate that our approach effectively generates multi-scale Gaussian representations with controllable levels of detail, visual fidelity, and a manageable time consumption budget.

97. 【2604.00493】A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

链接https://arxiv.org/abs/2604.00493

作者:Yabin Zhang,Chong Wang,Yunhe Gao,Jiaming Liu,Maya Varma,Justin Xu,Sophie Ostmeier,Jin Long,Sergios Gatidis,Seena Dehkharghani,Arne Michalson,Eun Kyoung Hong,Christian Bluethgen,Haiwei Henry Guo,Alexander Victor Ortiz,Stephan Altmayer,Sandhya Bodapati,Joseph David Janizek,Ken Chang,Jean-Benoit Delbrouck,Akshay S. Chaudhari,Curtis P. Langlotz

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Chest X-rays, imaging examinations worldwide, frequently performed imaging, performed imaging examinations, rising imaging volumes

备注: Codes: [this https URL](https://github.com/YBZh/CheXOne) Models: [this https URL](https://huggingface.co/StanfordAIMI/CheXOne)

点击查看摘要

Abstract:Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

98. 【2604.00479】All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

链接https://arxiv.org/abs/2604.00479

作者:Xinyu Tian,Shu Zou,Zhaoyuan Yang,Mengqi He,Peter Tu,Jing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:notably Group Relative, Group Relative Policy, Relative Policy Optimization, Reinforcement Learning, Group Relative

备注: Accepted to CVPR2026

点击查看摘要

Abstract:Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: this https URL

99. 【2604.00469】Automated Detection of Multiple Sclerosis Lesions on 7-tesla MRI Using U-net and Transformer-based Segmentation

链接https://arxiv.org/abs/2604.00469

作者:Michael Maynord,Minghui Liu,Cornelia Fermüller,Seongjin Choi,Yuxin Zeng,Shishir Dahal,Daniel M. Harrison

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:MRI improves visualization, MRI improves, white matter lesions, Lesion Segmentation Tool, automated segmentation tools

备注: 31 pages, 3 figures, 3 tables. Inference code and model weights available at [this https URL](https://github.com/maynord/7T-MS-lesion-segmentation)

点击查看摘要

Abstract:Ultra-high field 7-tesla (7T) MRI improves visualization of multiple sclerosis (MS) white matter lesions (WML) but differs sufficiently in contrast and artifacts from 1.5-3T imaging - suggesting that widely used automated segmentation tools may not translate directly. We analyzed 7T FLAIR scans and generated reference WML masks from Lesion Segmentation Tool (LST) outputs followed by expert manual revision. As external comparators, we applied LST-LPA and the more recent LST-AI ensemble, both originally developed on lower-field data. We then trained 3D UNETR and SegFormer transformer-based models on 7T FLAIR at multiple resolutions (0.5x0.5x0.5^3, 1.0x1.0x1.0^3, and 1.5x1.5x2.0^3) and evaluated all methods using voxel-wise and lesion-wise metrics from the BraTS 2023 framework. On the held-out test set at native 0.5x0.5x0.5^3 resolution, 7T-trained transformers achieved competitive overlap with LST-AI while recovering additional small lesions that were missed by classical methods, at the cost of some boundary variability and occasional artifact-related false positives. On a held-out 7 T test set, our best transformer model (SegFormer) achieved a voxel-wise Dice of 0.61 and lesion-wise Dice of 0.20, improving on the classical LST-LPA tool (Dice 0.39, lesion-wise Dice 0.02). Performance decreased for models trained on downsampled images, underscoring the value of native 7T resolution for small-lesion detection. By releasing our 7T-trained models, we aim to provide a reproducible, ready-to-use resource for automated lesion quantification in ultra-high field MS research (this https URL).

100. 【2604.00455】First Logit Boosting: Visual Grounding Method to Mitigate Object Hallucination in Large Vision-Language Models

链接https://arxiv.org/abs/2604.00455

作者:Jiwoo Ha,Jongwoo Baek,Jinhyun So

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent Large Vision-Language, Large Vision-Language Models, Recent Large, demonstrated remarkable performance, Large Vision-Language

备注: 19 pages, 13 figures

点击查看摘要

Abstract:Recent Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks that require understanding both visual and linguistic inputs. However, object hallucination -- the generation of nonexistent objects in answers -- remains a persistent challenge. Although several approaches such as retraining and external grounding methods have been proposed to mitigate this issue, they still suffer from high data costs or structural complexity. Training-free methods such as Contrastive Decoding (CD) are more cost-effective, avoiding additional training or external models, but still suffer from long-term decay, where visual grounding weakens and language priors dominate as the generation progresses. In this paper, we propose First Logit Boosting (FLB), a simple yet effective training-free technique designed to alleviate long-term decay in LVLMs. FLB stores the logit of the first generated token and adds it to subsequent token predictions, effectively mitigating long-term decay of visual information. We observe that FLB (1) sustains the visual information embedded in the first token throughout generation, and (2) suppresses hallucinated words through the stabilizing effect of the ``The'' token. Experimental results show that FLB significantly reduces object hallucination across various tasks, benchmarks, and backbone models. Notably, it causes negligible inference overhead, making it highly applicable to real-time multimodal systems. Code is available at this https URL

101. 【2604.00452】Out of Sight, Out of Track: Adversarial Attacks on Propagation-based Multi-Object Trackers via Query State Manipulation

链接https://arxiv.org/abs/2604.00452

作者:Halima Bouzidi,Haoyu Liu,Yonatan Gizachew Achamyeleh,Praneetsai Vasu Iddamsetty,Mohammad Abdullah Al Faruque

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advanced Multi-Object Tracking, Multi-Object Tracking, long-range temporal modeling, Recent, MOT

备注: Accepted for presentation at CVPR 2026 (main track)

点击查看摘要

Abstract:Recent Tracking-by-Query-Propagation (TBP) methods have advanced Multi-Object Tracking (MOT) by enabling end-to-end (E2E) pipelines with long-range temporal modeling. However, this reliance on query propagation introduces unexplored architectural vulnerabilities to adversarial attacks. We present FADE, a novel attack framework designed to exploit these specific vulnerabilities. FADE employs two attack strategies targeting core TBP mechanisms: (i) Temporal Query Flooding: Generates spurious temporally consistent track queries to exhaust the tracker's limited query budget, forcing it to terminate valid tracks. (ii) Temporal Memory Corruption: Directly attacks the query updater's memory by severing temporal links via state de-correlation and erasing the learned feature identity of matched tracks. Furthermore, we introduce a differentiable pipeline to optimize these attacks for physical-world realizability by leveraging simulations of advanced perception sensor spoofing. Experiments on MOT17 and MOT20 benchmarks demonstrate that FADE is highly effective against state-of-the-art TBP trackers, causing significant identity switches and track terminations.

102. 【2604.00416】Learning Humanoid Navigation from Human Data

链接https://arxiv.org/abs/2604.00416

作者:Weizhuo Wang,Yanjie Ze,C. Karen Liu,Monroe Kennedy III

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:human walking data, robot data, walking data, traverse diverse, hours of human

备注: 8 pages 8 figures

点击查看摘要

Abstract:We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: this https URL

103. 【2604.00404】he 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

链接https://arxiv.org/abs/2604.00404

作者:Xusheng He,Canyang Wu,Jinrong Zhang,Weili Guan,Jianlong Wu,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:PVUW MeViS-Text Challenge, report presents, presents our winning, winning solution, MeViS-Text Challenge

备注: 1st Place Solution for the 5th PVUW MeViS-Text Challenge (CVPR 2026 Workshop)

点击查看摘要

Abstract:This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a JF score of 0.7897. The code is available at this https URL.

104. 【2604.00402】COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

链接https://arxiv.org/abs/2604.00402

作者:Seohyoung Park,Jaeyeol Lim,Seoyoung Ju,Kyeonghun Kim,Nam-Joon Kim,Hyuk-Jae Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Developing robust models, Waymo Open Motion, Developing robust, including South Korea, Open Motion Dataset

备注: 4 pages, 2 figures. Accepted at ICEIC 2026

点击查看摘要

Abstract:Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

105. 【2604.00397】Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

链接https://arxiv.org/abs/2604.00397

作者:Yuchen Yang,Shuangyang Zhong,Haijun Yu,Langcuomu Suo,Hongbin Han,Florian Putz,Yixing Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:automated brain metastases, exhibit suboptimal performance, Deep learning, demonstrated significant potential, imaging protocols

备注: 5 figures and 1 table

点击查看摘要

Abstract:Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

Comments:
5 figures and 1 table

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2604.00397 [cs.CV]

(or
arXiv:2604.00397v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2604.00397

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yixing Huang [view email] [v1]
Wed, 1 Apr 2026 02:25:25 UTC (5,330 KB)

106. 【2604.00396】VLM-in-the-Loop: A Plug-In Quality Assurance Module for ECG Digitization Pipelines

链接https://arxiv.org/abs/2604.00396

作者:Jiachen Li,Shihao Li,Soovadeep Bakshi,Wei Li,Dongmei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong benchmark numbers, existing methods collapse, ECG digitization, benchmark numbers, unlock billions

备注

点击查看摘要

Abstract:ECG digitization could unlock billions of archived clinical records, yet existing methods collapse on real-world images despite strong benchmark numbers. We introduce \textbf{VLM-in-the-Loop}, a plug-in quality assurance module that wraps any digitization backend with closed-loop VLM feedback via a standardized interface, requiring no modification to the underlying digitizer. The core mechanism is \textbf{tool grounding}: anchoring VLM assessment in quantitative evidence from domain-specific signal analysis tools. In a controlled ablation on 200 records with paired ground truth, tool grounding raises verdict consistency from 71\% to 89\% and doubles fidelity separation ($\Delta$PCC 0.03 $\rightarrow$ 0.08), with the effect replicating across three VLMs (Claude Opus~4, GPT-4o, Gemini~2.5 Pro), confirming a pattern-level rather than model-specific gain. Deployed across four backends, the module improves every one: 29.4\% of borderline leads improved on our pipeline; 41.2\% of failed limb leads recovered on ECG-Digitiser; valid leads per image doubled on Open-ECG-Digitizer (2.5 $\rightarrow$ 5.8). On 428 real clinical HCM images, the integrated system reaches 98.0\% Excellent quality. Both the plug-in architecture and tool-grounding mechanism are domain-parametric, suggesting broader applicability wherever quality criteria are objectively measurable.

107. 【2604.00395】Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompt: The 1st Winner for 5th PVUW MOSE Challenge

链接https://arxiv.org/abs/2604.00395

作者:Jinrong Zhang,Canyang Wu,Xusheng He,Weili Guan,Jianlong Wu,Liqiang Nie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Object Segmentation, Complex Video Object, Object Segmentation task, Video Object, Object Segmentation

备注: 1st Place Solution for the 5th PVUW MOSE Challenge (CVPR 2026 Workshop)

点击查看摘要

Abstract:In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state-of-the-art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic-dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking-Enhanced Prompts. As a training-free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking-enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.

108. 【2604.00383】Mine-JEPA: In-Domain Self-Supervised Learning for Mine-Like Object Classification in Side-Scan Sonar

链接https://arxiv.org/abs/2604.00383

作者:Taeyoun Kwon,Youngwon Choi,Hyeonyu Kim,Myeongkyun Cho,Junhyeok Choi,Moon Hwan Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vision problem characterized, large domain gap, Side-scan sonar, extreme data scarcity, problem characterized

备注: 9 pages, 3 figures, 6 tables. Accepted at CVPR 2026 MACVi Workshop

点击查看摘要

Abstract:Side-scan sonar (SSS) mine classification is a challenging maritime vision problem characterized by extreme data scarcity and a large domain gap from natural images. While self-supervised learning (SSL) and general-purpose vision foundation models have shown strong performance in general vision and several specialized domains, their use in SSS remains largely unexplored. We present Mine-JEPA, the first in-domain SSL pipeline for SSS mine classification, using SIGReg, a regularization-based SSL loss, to pretrain on only 1,170 unlabeled sonar images. In the binary mine vs. non-mine setting, Mine-JEPA achieves an F1 score of 0.935, outperforming fine-tuned DINOv3 (0.922), a foundation model pretrained on 1.7B images. For 3-class mine-like object classification, Mine-JEPA reaches 0.820 with synthetic data augmentation, again outperforming fine-tuned DINOv3 (0.810). We further observe that applying in-domain SSL to foundation models degrades performance by 10--13 percentage points, suggesting that stronger pretrained models do not always benefit from additional domain adaptation. In addition, Mine-JEPA with a compact ViT-Tiny backbone achieves competitive performance while using 4x fewer parameters than DINOv3. These results suggest that carefully designed in-domain self-supervised learning is a viable alternative to much larger foundation models in data-scarce maritime sonar imagery.

109. 【2604.00382】mmAnomaly: Leveraging Visual Context for Robust Anomaly Detection in the Non-Visual World with mmWave Radar

链接https://arxiv.org/abs/2604.00382

作者:Tarik Reza Toha,Shao-Jung(Louie)Lu,Mahathir Monjur,Shahriar Nirjon

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:walls-where traditional cameras, traditional cameras fail, cameras fail due, radar enables human, enables human sensing

备注: Accepted at the 24th ACM/IEEE International Conference on Embedded Artificial Intelligence and Sensing Systems (SenSys 2026)

点击查看摘要

Abstract:mmWave radar enables human sensing in non-visual scenarios-e.g., through clothing or certain types of walls-where traditional cameras fail due to occlusion or privacy limitations. However, robust anomaly detection with mmWave remains challenging, as signal reflections are influenced by material properties, clutter, and multipath interference, producing complex, non-Gaussian distortions. Existing methods lack contextual awareness and misclassify benign signal variations as anomalies. We present mmAnomaly, a multi-modal anomaly detection framework that combines mmWave radar with RGBD input to incorporate visual context. Our system extracts semantic cues-such as scene geometry and material properties-using a fast ResNet-based classifier, and uses a conditional latent diffusion model to synthesize the expected mmWave spectrum for the given visual context. A dual-input comparison module then identifies spatial deviations between real and generated spectra to localize anomalies. We evaluate mmAnomaly on two multi-modal datasets across three applications: concealed weapon localization, through-wall intruder localization, and through-wall fall localization. The system achieves up to 94% F1 score and sub-meter localization error, demonstrating robust generalization across clothing, occlusions, and cluttered environments. These results establish mmAnomaly as an accurate and interpretable framework for context-aware anomaly detection in mmWave sensing.

110. 【2604.00381】UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration

链接https://arxiv.org/abs/2604.00381

作者:Daehyun Kim,Youngmin Kim,Yoon Ju Oh,Tae Hyun Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Under-display cameras, imaging sensor underneath, textbf, full-screen designs, designs by positioning

备注: We propose UCMNet, an uncertainty-aware adaptive framework that restores high-frequency details in regions with varying levels of degradation in under-display camera images

点击查看摘要

Abstract:Under-display cameras (UDCs) allow for full-screen designs by positioning the imaging sensor underneath the display. Nonetheless, light diffraction and scattering through the various display layers result in spatially varying and complex degradations, which significantly reduce high-frequency details. Current PSF-based physical modeling techniques and frequency-separation networks are effective at reconstructing low-frequency structures and maintaining overall color consistency. However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight \textbf{U}ncertainty-aware \textbf{C}ontext-\textbf{M}emory \textbf{Network} (\textbf{UCMNet}), for UDC image restoration. Unlike previous methods that apply uniform restoration, UCMNet performs uncertainty-aware adaptive processing to restore high-frequency details in regions with varying degradations. The estimated uncertainty maps, learned through an uncertainty-driven loss, quantify spatial uncertainty induced by diffraction and scattering, and guide the Memory Bank to retrieve region-adaptive context from the Context Bank. This process enables effective modeling of the non-uniform degradation characteristics inherent to UDC imaging. Leveraging this uncertainty as a prior, UCMNet achieves state-of-the-art performance on multiple benchmarks with 30\% fewer parameters than previous models. Project page: \href{this https URL}{this https URL}.

111. 【2604.00372】Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

链接https://arxiv.org/abs/2604.00372

作者:Qiong Liu,Ruofei Xiong,Xingzhen Chen,Muyao Peng,You Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modality of color, local features, great importance, importance in recent, recent research

备注

点击查看摘要

Abstract:Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.

112. 【2604.00371】Neural Reconstruction of LiDAR Point Clouds under Jamming Attacks via Full-Waveform Representation and Simultaneous Laser Sensing

链接https://arxiv.org/abs/2604.00371

作者:Ryo Yoshida,Takami Sato,Wenlun Zhang,Yuki Hayakawa,Shota Nagai,Takahiro Kado,Taro Beppu,Ibuki Fujioka,Yunshan Zhong,Kentaro Yoshioka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving perception, Jamming attacks, critical for autonomous, remain vulnerable, vulnerable to spoofing

备注

点击查看摘要

Abstract:LiDAR sensors are critical for autonomous driving perception, yet remain vulnerable to spoofing attacks. Jamming attacks inject high-frequency laser pulses that completely blind LiDAR sensors by overwhelming authentic returns with malicious signals. We discover that while point clouds become randomized, the underlying full-waveform data retains distinguishable signatures between attack and legitimate signals. In this work, we propose PULSAR-Net, capable of reconstructing authentic point clouds under jamming attacks by leveraging previously underutilized intermediate full-waveform representations and simultaneous laser sensing in modern LiDAR systems. PULSAR-Net adopts a novel U-Net architecture with axial spatial attention mechanisms specifically designed to identify attack-induced signals from authentic object returns in the full-waveform representation. To address the lack of full-waveform representations in existing LiDAR datasets under jamming attacks, we introduce a physics-aware dataset generation pipeline that synthesizes realistic full-waveform representations under jamming attacks. Despite being trained exclusively on synthetic data, PULSAR-Net achieves reconstruction rates of 92% and 73% for vehicles obscured by jamming attacks in real-world static and driving scenarios, respectively.

113. 【2604.00363】A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person Tracking

链接https://arxiv.org/abs/2604.00363

作者:Yuki Minase,Kanji Tanaka

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:mobile robots operating, autonomous mobile robots, unpredictable environments, Robust person tracking, critical capability

备注: 6 pages, 4 figures, technical report

点击查看摘要

Abstract:Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy -- referred to as ``Fine-grained Differential Learning Rate Strategy'' -- we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7\%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.

114. 【2604.00360】VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space

链接https://arxiv.org/abs/2604.00360

作者:Jihao Lyu,Minghua Zhao,Jing Hu,Yifei Chen,Shuangli Du,Cheng Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Anomaly Detection, Anomaly Detection, Video Anomaly, single proxy task, achieving high accuracy

备注

点击查看摘要

Abstract:VADMamba pioneered the introduction of Mamba to Video Anomaly Detection (VAD), achieving high accuracy and fast inference through hybrid proxy tasks. Nevertheless, its heavy reliance on optical flow as auxiliary input and inter-task fusion scoring constrains its applicability to a single proxy task. In this paper, we introduce VADMamba++, an efficient VAD method based on the Gray-to-RGB paradigm that enforces a Single-Channel to Three-Channel reconstruction mapping, designed for a single proxy task and operating without auxiliary inputs. This paradigm compels inferring color appearances from grayscale structures, allowing anomalies to be more effectively revealed through dual inconsistencies between structure and chromatic cues. Specifically, VADMamba++ reconstructs grayscale frames into the RGB space to simultaneously discriminate structural geometry and chromatic fidelity, thereby enhancing sensitivity to explicit visual anomalies. We further design a hybrid modeling backbone that integrates Mamba, CNN, and Transformer modules to capture diverse normal patterns while suppressing the appearance of anomalies. Furthermore, an intra-task fusion scoring strategy integrates explicit future-frame prediction errors with implicit quantized feature errors, further improving accuracy under a single task setting. Extensive experiments on three benchmark datasets demonstrate that VADMamba++ outperforms state-of-the-art methods while meeting performance and efficiency, especially under a strict single-task setting with only frame-level inputs.

115. 【2604.00313】Label-efficient underwater species classification with semi-supervised learning on frozen foundation model embeddings

链接https://arxiv.org/abs/2604.00313

作者:Thomas Manuel Rost

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated species classification, dataset rarely transfer, Automated species, classification from underwater, underwater imagery

备注

点击查看摘要

Abstract:Automated species classification from underwater imagery is bottlenecked by the cost of expert annotation, and supervised models trained on one dataset rarely transfer to new conditions. We investigate whether semi-supervised methods operating on frozen foundation model embeddings can close this annotation gap with minimal labeling effort. Using DINOv3 ViT-B embeddings with no fine-tuning, we propagate a small set of labeled seeds through unlabeled data via nearest-neighbor-based self-training and evaluate on the AQUA20 benchmark (20 marine species). With fewer than 5% of the training labels, self-training on frozen embeddings closes much of the gap to a fully supervised ConvNeXt baseline trained on the entire labeled dataset; at full supervision, the gap narrows to a few percentage points, with several species exceeding the supervised baseline. Class separability in the embedding space, measured by ROC-AUC, is high even at extreme label scarcity, indicating that the frozen representations capture discriminative structure well before decision boundaries can be reliably estimated. Our approach requires no training, no domain-specific data engineering, and no underwater-adapted models, establishing a practical, immediately deployable baseline for label-efficient marine species recognition. All results are reported on the held-out test set over 100 random seed initializations.

116. 【2604.00298】SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction

链接https://arxiv.org/abs/2604.00298

作者:Italo Felix Santos,Gilson Antonio Giraldi,Heron Werner Junior

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:removing textual conditioning, extends the SANA, SANA family, framework that extends, family by removing

备注

点击查看摘要

Abstract:We propose SANA-I2I, a text-free high-resolution image-to-image generation framework that extends the SANA family by removing textual conditioning entirely. In contrast to SanaControlNet, which combines text and image-based control, SANA-I2I relies exclusively on paired source-target images to learn a conditional flow-matching model in latent space. The model learns a conditional velocity field that maps a target image distribution to another one, enabling supervised image translation without reliance on language prompts. We evaluate the proposed approach on the challenging task of fetal MRI motion artifact reduction. To enable paired training in this application, where real paired data are difficult to acquire, we adopt a synthetic data generation strategy based on the method proposed by Duffy et al., which simulates realistic motion artifacts in fetal magnetic resonance imaging (MRI). Experimental results demonstrate that SANA-I2I effectively suppresses motion artifacts while preserving anatomical structure, achieving competitive performance few inference steps. These results highlight the efficiency and suitability of our proposed flow-based, text-free generative models for supervised image-to-image tasks in medical imaging.

117. 【2604.00279】he Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

链接https://arxiv.org/abs/2604.00279

作者:Hongyuan Liu,Qinli Yang,Wen Li,Zhong Zhang,Jiaming Liu,Wei Han,Zhili Qin,Jinxia Guo,Junming Shao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remain geometrically separated, shared embedding space, representations remain geometrically, CLIP learn, gap

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ($R^2 = 0.986$), whereas the commonly used Raw Gap is misleading ($R^2 = 0.691$). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a fine-tuning framework that explicitly reduces both components. The proposed CMA jointly mitigates centroid offsets and reshapes the distributional structure, while a three-phase curriculum with gradient-aware scheduling progressively introduces alignment during training to enable stable optimization. Experiments demonstrate that our method significantly improves cross-modal alignment. With $\alpha_{\text{target}}{=}0.05$, the modality gap is reduced by 66.6\% with only 4.84\% accuracy drop. Under stronger alignment ($\alpha_{\text{target}}{=}0.5$), the gap is reduced by 82.3\%, clustering ARI improves from 0.318 to 0.516, and captioning CIDEr increases by 57.1\% over the original model. Our code and pre-trained models will be made publicly available upon acceptance.

118. 【2604.00276】Excite, Attend and Segment (EASe): Domain-Agnostic Fine-Grained Mask Discovery with Feature Calibration and Self-Supervised Upsampling

链接https://arxiv.org/abs/2604.00276

作者:Deepank Singh,Anurag Nihal,Vedhus Hoskere

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:leveraged foundation models, increasingly leveraged foundation, improve salient object, salient object discovery, foundation models

备注

点击查看摘要

Abstract:Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi-component morphologies, where fine-grained structural detail is indispensable. Many state-of-the-art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch-level representations. These coarse representations inherently suppress the fine-grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain-agnostic semantic segmentation framework for easy fine-grained mask discovery across challenging real-world scenes. EASe utilizes novel Semantic-Aware Upsampling with Channel Excitation (SAUCE) to excite low-resolution FM feature channels for selective calibration and attends across spatially-encoded image and FM features to recover full-resolution semantic representations. Finally, EASe segments the aggregated features into multi-granularity masks using a novel training-free Cue-Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel-level feature representations to enable accurate fine-grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state-of-the-arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at this https URL

119. 【2604.00270】OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

链接https://arxiv.org/abs/2604.00270

作者:Taiting Lu,Kaiyuan Lin,Yuxin Tian,Yubo Wang,Muchuan Wang,Sharique Khatri,Akshit Kartik,Yixi Wang,Amey Santosh Rane,Yida Wang,Yifan Yang,Yi-Chao Chen,Yincheng Jin,Mahanth Gowda

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent large multimodal, large multimodal models, made rapid progress, Printed Circuit Board, Recent large

备注

点击查看摘要

Abstract:Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.

120. 【2604.00267】Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

链接https://arxiv.org/abs/2604.00267

作者:Xinpeng Li,Bolin Lai,Hardy Chen,Shijian Deng,Cihang Xie,Yuyin Zhou,James Matthew Rehg,Yapeng Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requires comprehensive social, comprehensive social interaction, speech input, social interaction understanding, social interaction

备注: Accepted to CVPR 2026. Project page: [this https URL](https://sampson-lee.github.io/omni-mmsi-project-page)

点击查看摘要

Abstract:We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: this https URL.

121. 【2604.00265】Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation

链接https://arxiv.org/abs/2604.00265

作者:Edoardo Zorzi,Francesco Taioli,Yiming Wang,Marco Cristani,Alessandro Farinelli,Alberto Castellini,Loris Bazzani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Collaborative Instance Object, Instance Object Navigation, similar object instances, enables an explicit, separate assessment

备注

点击查看摘要

Abstract:We propose Question-Asking Navigation (QAsk-Nav), the first reproducible benchmark for Collaborative Instance Object Navigation (CoIN) that enables an explicit, separate assessment of embodied navigation and collaborative question asking. CoIN tasks an embodied agent with reaching a target specified in free-form natural language under partial observability, using only egocentric visual observations and interactive natural-language dialogue with a human, where the dialogue can help to resolve ambiguity among visually similar object instances. Existing CoIN benchmarks are primarily focused on navigation success and offer no support for consistent evaluation of collaborative interaction. To address this limitation, QAsk-Nav provides (i) a lightweight question-asking protocol scored independently of navigation, (ii) an enhanced navigation protocol with realistic, diverse, high-quality target descriptions, and (iii) an open-source dataset, that includes 28,000 quality-checked reasoning and question-asking traces for training and analysis of interactive capabilities of CoIN models. Using the proposed QAsk-Nav benchmark, we develop Light-CoNav, a lightweight unified model for collaborative navigation that is 3x smaller and 70x faster than existing modular methods, while outperforming state-of-the-art CoIN approaches in generalization to unseen objects and environments. Project page at this https URL

122. 【2604.00250】PRISM: Differentiable Analysis-by-Synthesis for Fixel Recovery in Diffusion MRI

链接https://arxiv.org/abs/2604.00250

作者:Mohamed Abouagour,Atharva Shah,Eleftherios Garyfallidis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion MRI microstructure, Diffusion MRI, MRI microstructure fitting, MRI microstructure, limits fiber peak

备注: 10 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Diffusion MRI microstructure fitting is nonconvex and often performed voxelwise, which limits fiber peak recovery in narrow crossings. This work introduces PRISM, a differentiable analysis-by-synthesis framework that fits an explicit multi-compartment forward model end-to-end over spatial patches. The model combines cerebrospinal fluid (CSF), gray matter, up to K white-matter fiber compartments (stick-and-zeppelin), and a restricted compartment, with explicit fiber directions and soft model selection via repulsion and sparsity priors. PRISM supports a fast MSE objective and a Rician negative log-likelihood (NLL) that jointly learns sigma without oracle information. A lightweight nuisance calibration module (smooth bias field and per-measurement scale/offset) is included for robustness and regularized to identity in clean-data tests. On synthetic crossing-fiber data (SNR=30; five methods, 16 crossing angles), PRISM achieves 3.5 degrees best-match angular error with 95% recall, which is 1.9x lower than the best baseline (MSMT-CSD, 6.8 degrees, 83% recall); in NLL mode with learned sigma, error drops to 2.3 degrees with 99% recall, resolving crossings down to 20 degrees. On the DiSCo1 phantom (NLL mode), PRISM improves connectivity correlation over CSD baselines at all four tracking angles (best r=.934 at 25 degrees vs. .920 for MSMT-CSD). Whole-brain HCP fitting (~741k voxels, MSE mode) completes in ~12 min on a single GPU with near-identical results across random seeds.

123. 【2604.00243】UCell: rethinking generalizability and scaling of bio-medical vision models

链接https://arxiv.org/abs/2604.00243

作者:Nicholas Kuang,Vanessa Scalon,Ji Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:deep learning field, modern deep learning, deep learning, learning field, models

备注

点击查看摘要

Abstract:The modern deep learning field is a scale-centric one. Larger models have been shown to consistently perform better than smaller models of similar architecture. In many sub-domains of biomedical research, however, the model scaling is bottlenecked by the amount of available training data, and the high cost associated with generating and validating additional high quality data. Despite the practical hurdle, the majority of the ongoing research still focuses on building bigger foundation models, whereas the alternative of improving the ability of small models has been under-explored. Here we experiment with building models with 10-30M parameters, tiny by modern standards, to perform the single-cell segmentation task. An important design choice is the incorporation of a recursive structure into the model's forward computation graph, leading to a more parameter-efficient architecture. We found that for the single-cell segmentation, on multiple benchmarks, our small model, UCell, matches the performance of models 10-20 times its size, and with a similar generalizability to unseen out-of-domain data. More importantly, we found that ucell can be trained from scratch using only a set of microscopy imaging data, without relying on massive pretraining on natural images, and therefore decouples the model building from any external commercial interests. Finally, we examined and confirmed the adaptability of ucell by performing a wide range of one-shot and few-shot fine tuning experiments on a diverse set of small datasets. Implementation is available at this https URL

124. 【2604.00199】QUEST: A robust attention formulation using query-modulated spherical attention

链接https://arxiv.org/abs/2604.00199

作者:Hariprasath Govindarajan,Per Sidén,Jacob Roll,Fredrik Lindsten

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Transformer model architecture, deep learning, attention, attention mechanism, simple Transformer models

备注: Accepted to ICLR 2026

点击查看摘要

Abstract:The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.

125. 【2604.00175】Sit-to-Stand Transitions Detection and Duration Measurement Using Smart Lacelock Sensor

链接https://arxiv.org/abs/2604.00175

作者:Md Rafi Islam,Md Rejwanul Haque,Elizabeth Choma,Shannon Hayes,Siobhan McMahon,Xiangrong Shen,Edward Sazonov

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:experience age-related declines, Postural stability, Smart Lacelock sensor, independent living, stability during movement

备注: 10 pages, 11 figures

点击查看摘要

Abstract:Postural stability during movement is fundamental to independent living, fall prevention, and overall health, particularly among older adults who experience age-related declines in balance, muscle strength, and mobility. Among daily functional activities, the Sit-to-Stand (SiSt) transition is a critical indicator of lower-limb strength, musculoskeletal health, and fall risk, making it an essential parameter for assessing functional capacity and monitoring physical decline in aging populations. This study presents a methodology SiSt transition detection and duration measurement using the Smart Lacelock sensor, a lightweight, shoe-mounted device that integrates a load cell, accelerometer, and gyroscope for motion analysis. The methodology was evaluated in 16 older adults (age: mean: 76.84, SD: 3.45 years) performing SiSt tasks within the Short Physical Performance Battery (SPPB) protocol. Features extracted from multimodal signals were used to train and evaluate four machine learning classifiers using a 4-fold participant-independent cross-validation to classify SiSt transitions and measure their duration. The bagged tree classifier achieved an accuracy of 0.98 and an F1 score of 0.8 in classifying SiSt transition. The mean absolute error in duration measurement of the correctly classified transitions was 0.047, and the SD was 0.07 seconds. These findings highlight the potential of the Smart Lacelock sensor for real-world fall-risk assessment and mobility monitoring in older adults.

126. 【2604.00172】Suppressing Non-Semantic Noise in Masked Image Modeling Representations

链接https://arxiv.org/abs/2604.00172

作者:Martine Hjelkrem-Tan,Marius Aasan,Rwiddhi Chakraborty,Gabriel Y. Arteaga,Changkyu Choi,Adín Ramírez Rivera

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Masked Image Modeling, self-supervised vision paradigm, ubiquitous self-supervised vision, Masked Image, Image Modeling

备注: Published in CVPR 2026

点击查看摘要

Abstract:Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.

127. 【2604.00161】Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

链接https://arxiv.org/abs/2604.00161

作者:Longwei Xu,Feng Feng,Shaojie Zhang,Xin Chen,Hang Li,Anan Du,Hailong Yu,Pei Fu,Zhenbo Luo,Jian Luan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical Character Recognition, Optical Character, modern vision-language models, visual question answering, support downstream reasoning

备注

点击查看摘要

Abstract:Optical Character Recognition (OCR) is increasingly regarded as a foundational capability for modern vision-language models (VLMs), enabling them not only to read text in images but also to support downstream reasoning in real-world visual question answering (VQA). However, practical applications further require reliable text anchors, i.e., accurately grounding queried text to its corresponding spatial region. To systematically evaluate this capability, we introduce TextAnchor-Bench (TABench), a benchmark for fine-grained text-region grounding, which reveals that both general-purpose and OCR-specific VLMs still struggle to establish accurate and stable text anchors. To address this limitation, we propose Q-Mask, a precise OCR framework built upon a causal query-driven mask decoder (CQMD). Inspired by chain-of-thought reasoning, Q-Mask performs causal visual decoding that sequentially generates query-conditioned visual masks before producing the final OCR output. This visual CoT paradigm disentangles where the text is from what the text is, enforcing grounded evidence acquisition prior to recognition and enabling explicit text anchor construction during inference. To train CQMD, we construct TextAnchor-26M, a large-scale dataset of image-text pairs annotated with fine-grained masks corresponding to specific textual elements, encouraging stable text-region correspondences and injecting strong spatial priors into VLM training. Extensive experiments demonstrate that Q-Mask substantially improves text anchoring and understanding across diverse visual scenes.

128. 【2604.00093】RawGen: Learning Camera Raw Image Generation

链接https://arxiv.org/abs/2604.00093

作者:Dongyoung Kim,Junyong Lee,Abhijith Punnappurath,Mahmoud Afifi,Sangmin Han,Alex Levinshtein,Michael S. Brown

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image signal processors, onboard image signal, signal processors, Cameras capture scene-referred, processed by onboard

备注

点击查看摘要

Abstract:Cameras capture scene-referred linear raw images, which are processed by onboard image signal processors (ISPs) into display-referred 8-bit sRGB outputs. Although raw data is more faithful for low-level vision tasks, collecting large-scale raw datasets remains a major bottleneck, as existing datasets are limited and tied to specific camera hardware. Generative models offer a promising way to address this scarcity -- however, existing diffusion frameworks are designed to synthesize photo-finished sRGB images rather than physically consistent linear representations. This paper presents RawGen, to our knowledge the first diffusion-based framework enabling text-to-raw generation for arbitrary target cameras, alongside sRGB-to-raw inversion. RawGen leverages the generative priors of large-scale sRGB diffusion models to synthesize physically meaningful linear outputs, such as CIE XYZ or camera-specific raw representations, via specialized processing in latent and pixel spaces. To handle unknown and diverse ISP pipelines and photo-finishing effects in diffusion-model training data, we build a many-to-one inverse-ISP dataset where multiple sRGB renditions of the same scene generated using diverse ISP parameters are anchored to a common scene-referred target. Fine-tuning a conditional denoiser and specialized decoder on this dataset allows RawGen to obtain camera-centric linear reconstructions that effectively invert the rendering pipeline. We demonstrate RawGen's superior performance over traditional inverse-ISP methods that assume a fixed ISP. Furthermore, we show that augmenting training pipelines with RawGen's scalable, text-driven synthetic data can benefit downstream low-level vision tasks.

129. 【2604.00086】Hierarchical Pre-Training of Vision Encoders with Large Language Models

链接https://arxiv.org/abs/2604.00086

作者:Eugene Lee,Ting-Yu Chang,Jui-Huang Tsai,Jiajie Diao,Chen-Yi Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:experienced significant advancements, scalable vision encoders, vision encoders, vision encoder, treat vision encoders

备注: 17 pages, 14 figures, accepted to Computer Vision and Pattern Recognition Conference (CVPR) Workshops 2026. 5th MMFM Workshop: What is Next in Multimodal Foundation Models?

点击查看摘要

Abstract:The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.

130. 【2604.00055】Generalizable Dense Reward for Long-Horizon Robotic Tasks

链接https://arxiv.org/abs/2604.00055

作者:Silong Yong,Stephen Sheng,Carl Qi,Xiaojie Wang,Evan Sheehan,Anurag Shivaprasad,Yaqi Xie,Katia Sycara,Yesh Dattatreya

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Existing robotic foundation, robotic foundation policies, Existing robotic, large-scale imitation learning, robotic foundation

备注: Project page: [this https URL](https://silongyong.github.io/vllr_project_page/)

点击查看摘要

Abstract:Existing robotic foundation policies are trained primarily via large-scale imitation learning. While such models demonstrate strong capabilities, they often struggle with long-horizon tasks due to distribution shift and error accumulation. While reinforcement learning (RL) can finetune these models, it cannot work well across diverse tasks without manual reward engineering. We propose VLLR, a dense reward framework combining (1) an extrinsic reward from Large Language Models (LLMs) and Vision-Language Models (VLMs) for task progress recognition, and (2) an intrinsic reward based on policy self-certainty. VLLR uses LLMs to decompose tasks into verifiable subtasks and then VLMs to estimate progress to initialize the value function for a brief warm-up phase, avoiding prohibitive inference cost during full training; and self-certainty provides per-step intrinsic guidance throughout PPO finetuning. Ablation studies reveal complementary benefits: VLM-based value initialization primarily improves task completion efficiency, while self-certainty primarily enhances success rates, particularly on out-of-distribution tasks. On the CHORES benchmark covering mobile manipulation and navigation, VLLR achieves up to 56% absolute success rate gains over the pretrained policy, up to 5% gains over state-of-the-art RL finetuning methods on in-distribution tasks, and up to $10\%$ gains on out-of-distribution tasks, all without manual reward engineering. Additional visualizations can be found in this https URL

131. 【2603.19660】Semantic Audio-Visual Navigation in Continuous Environments

链接https://arxiv.org/abs/2603.19660

作者:Yichen Zeng,Hebaixu Wang,Meng Liu,Yu Zhou,Chen Gao,Kehan Chen,Gongping Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:enables embodied agents, navigation enables embodied, Semantic Audio-Visual Navigation, navigate toward sound-emitting, leveraging both auditory

备注: This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1\% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at this https URL.

132. 【2604.01167】AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

链接https://arxiv.org/abs/2604.01167

作者:Prantik Deb,Srimanth Dhondy,N. Ramakrishna,Anu Kapoor,Raju S. Bapi,Tapabrata Chakraborti

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Chest X-ray, settings remains challenging, remains challenging due, clinical settings remains, deploying large foundation

备注: Accepted to ISBI 2026(Oral Presentation)

点击查看摘要

Abstract:Chest X-ray (CXR) segmentation is an important step in computer-aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA-QAT, a two-stage fine-tuning framework that combines adaptive low-rank encoder adaptation with full quantization-aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed-precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large-scale CXR datasets, AdaLoRA-QAT achieves 95.6% Dice, matching full-precision SAM decoder fine-tuning while reducing trainable parameters by 16.6\times and yielding 2.24\times model compression. A Wilcoxon signed-rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA-QAT effectively balances accuracy, efficiency, and structural trust-worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: this https URL

133. 【2604.00359】AI-assisted Human-in-the-Loop Web Platform for Structural Characterization in Hard drive design

链接https://arxiv.org/abs/2604.00359

作者:Utkarsh Pratiush,Huaixun Huyan,Maryam Zahiri Azar,Esmeralda Yitamben,Allen Bourez,Sergei V Kalinin,Vasfi Burak Ozdol

类目:Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)

关键词:Scanning transmission electron, transmission electron microscopy, define device performance, Scanning transmission, enabling nanoscale analysis

备注

点击查看摘要

Abstract:Scanning transmission electron microscopy (STEM) has become a cornerstone instrument for semiconductor materials metrology, enabling nanoscale analysis of complex multilayer structures that define device performance. Developing effective metrology workflows for such systems requires balancing automation with flexibility; rigid pipelines are brittle to sample variability, while purely manual approaches are slow and subjective. Here, we present a tunable human-AI-assisted workflow framework that enables modular and adaptive analysis of STEM images for device characterization. As an illustrative example, we demonstrate a workflow for automated layer thickness and interface roughness quantification in multilayer thin films. The system integrates gradient-based peak detection with interactive correction modules, allowing human input at the design stage while maintaining fully automated execution across samples. Implemented as a web-based interface, it processes TEM/EMD files directly, applies noise reduction and interface tracking algorithms, and outputs statistical roughness and thickness metrics with nanometer precision. This architecture exemplifies a general approach toward adaptive, reusable metrology workflows - bridging human insight and machine precision for scalable, standardized analysis in semiconductor manufacturing. The code is made available at this https URL

134. 【2604.00263】Feature-level Site Leakage Reduction for Cross-Hospital Chest X-ray Transfer via Self-Supervised Learning

链接https://arxiv.org/abs/2604.00263

作者:Ayoub Louaye Bouaziz,Lokmane Chebouba

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:chest X-ray models, chest X-ray, X-ray models, domain shift, work assumes invariance

备注: Accepted at The 7th International Conference on Computing Systems and Applications [Algiers,2026]

点击查看摘要

Abstract:Cross-hospital failure in chest X-ray models is often attributed to domain shift, yet most work assumes invariance without measuring it. This paper studies how to measure site leakage directly and how that measurement changes conclusions about transfer methods. We study multi-site self-supervised learning (SSL) and feature-level adversarial site confusion for cross-hospital transfer. We pretrain a ResNet-18 on NIH and CheXpert without pathology labels. We then freeze the encoder and train a linear pneumonia classifier on NIH only, evaluating transfer to RSNA. We quantify site leakage using a post hoc linear probe that predicts acquisition site from frozen backbone features $f$ and projection features $z$. Across 3 random seeds, multi-site SSL improves RSNA AUC from 0.6736 $\pm$ 0.0148 (ImageNet initialization) to 0.7804 $\pm$ 0.0197. Adding adversarial site confusion on $f$ reduces measured leakage but does not reliably improve AUC and increases variance. On $f$, site probe accuracy drops from 0.9890 $\pm$ 0.0021 (SSL-only) to 0.8504 $\pm$ 0.0051 (CanonicalF), where chance is 0.50. On $z$, probe accuracy drops from 0.8912 $\pm$ 0.0092 to 0.7810 $\pm$ 0.0250. These results show that measuring leakage changes how transfer methods should be interpreted: multi-site SSL drives transfer, while adversarial confusion exposes the limits of invariance assumptions.

135. 【2604.00225】Pupil Design for Computational Wavefront Estimation

链接https://arxiv.org/abs/2604.00225

作者:Ali Almuallem,Nicholas Chimitt,Bole Ma,Qi Guo,Stanley H. Chan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Establishing a precise, computational microscopy, adaptive optics, precise connection, connection between imaged

备注

点击查看摘要

Abstract:Establishing a precise connection between imaged intensity and the incident wavefront is essential for emerging applications in adaptive optics, holography, computational microscopy, and non-line-of-sight imaging. While prior work has shown that breaking symmetries in pupil design enables wavefront recovery from a single intensity measurement, there is little guidance on how to design a pupil that improves wavefront estimation. In this work we introduce a quantitative asymmetry metric to bridge this gap and, through an extensive empirical study and supporting analysis, demonstrate that increasing asymmetry enhances wavefront recoverability. We analyze the trade-offs in pupil design, and the impact on light throughput along with performance in noise. Both large-scale simulations and optical bench experiments are carried out to support our findings.

136. 【2604.00070】Brain MR Image Synthesis with Multi-contrast Self-attention GAN

链接https://arxiv.org/abs/2604.00070

作者:Zaid A. Abod,Furqan Aziz

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, multi-modal Magnetic Resonance, Resonance Imaging, Magnetic Resonance, complete multi-modal Magnetic

备注: Note: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurate and complete multi-modal Magnetic Resonance Imaging (MRI) is essential for neuro-oncological assessment, as each contrast provides complementary anatomical and pathological information. However, acquiring all modalities (e.g., T1c, T1n, T2, T2f) for every patient is often impractical due to time, cost, and patient discomfort, potentially limiting comprehensive tumour evaluation. We propose 3D-MC-SAGAN (3D Multi-Contrast Self-Attention generative adversarial network), a unified 3D multi-contrast synthesis framework that generates high-fidelity missing modalities from a single T2 input while explicitly preserving tumour characteristics. The model employs a multi-scale 3D encoder-decoder generator with residual connections and a novel Memory-Bounded Hybrid Attention (MBHA) block to capture long-range dependencies efficiently, and is trained with a WGAN-GP critic and an auxiliary contrast-conditioning branch to produce T2f, T1n, and T1c volumes within a single unified network. A frozen 3D U-Net-based segmentation module introduces a segmentation-consistency constraint to preserve lesion morphology. The composite objective integrates adversarial, reconstruction, perceptual, structural similarity, contrast-classification, and segmentation-guided losses to align global realism with tumour-preserving structure. Extensive evaluation on 3D brain MRI datasets demonstrates that 3D-MC-SAGAN achieves state-of-the-art quantitative performance and generates visually coherent, anatomically plausible contrasts with improved distribution-level realism. Moreover, it maintains tumour segmentation accuracy comparable to fully acquired multi-modal inputs, highlighting its potential to reduce acquisition burden while preserving clinically meaningful information.