本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新516篇论文,其中:
- 自然语言处理131篇
- 信息检索14篇
- 计算机视觉88篇
自然语言处理
1. 【2601.04160】All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
链接:https://arxiv.org/abs/2601.04160
作者:Yuechen Jiang,Zhiwei Liu,Yupeng Cao,Yueru He,Ziyang Xu,Chen Xu,Zhiyang Deng,Prayag Tiwari,Xi Chen,Alejandro Lopez-Lira,Jimin Huang,Junichi Tsujii,Sophia Ananiadou
类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
关键词:introduce RFC Bench, evaluating large language, RFC Bench, RFC Bench operates, large language models
备注: 39 pages; 24 figures
点击查看摘要
Abstract:We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.
2. 【2601.04157】FLEx: Language Modeling with Few-shot Language Explanations
链接:https://arxiv.org/abs/2601.04157
作者:Adar Avsian,Christopher Richardson,Anirudh Sundar,Larry Heck
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:open-domain question answering, math problem solving, range of tasks, question answering, wide range
备注:
点击查看摘要
Abstract:Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83\% of CoT's remaining errors.
3. 【2601.04135】LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation
链接:https://arxiv.org/abs/2601.04135
作者:Leonardo Bottona,Nicolò Penzo,Bruno Lepri,Marco Guerini,Sara Tonelli
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:present LLMberjack, existing debates, originally structured, starting from existing, structured as reply
备注: 9 pages, 3 figures
点击查看摘要
Abstract:We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers' descriptions. We demonstrate the platform's utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.
4. 【2601.04131】ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models
链接:https://arxiv.org/abs/2601.04131
作者:Nikhil Anand,Shwetha Somasundaram,Anirudh Phukan,Apoorv Saxena,Koyel Mukherjee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, encode vast amounts, encode vast, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.
5. 【2601.04126】InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
链接:https://arxiv.org/abs/2601.04126
作者:Ziyun Zhang,Zezhou Wang,Xiaoyi Zhang,Zongyu Guo,Jiahao Li,Bin Li,Yan Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:practical AI assistants, interact with graphical, graphical interfaces, interfaces on behalf, behalf of users
备注: Work In Progress
点击查看摘要
Abstract:GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
6. 【2601.04098】Layer-wise Positional Bias in Short-Context Language Modeling
链接:https://arxiv.org/abs/2601.04098
作者:Maryam Rahimi,Mahdi Nouri,Yadollah Yaghoobzadeh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:semantic relevance, information from specific, positions, Abstract, specific positions
备注:
点击查看摘要
Abstract:Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.
7. 【2601.04093】SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks
链接:https://arxiv.org/abs/2601.04093
作者:Yu Yan,Sheng Sun,Mingfeng Li,Zheming Yang,Chiwei Zhu,Fei Ma,Benfeng Xu,Min Liu
类目:Computation and Language (cs.CL)
关键词:people have suffered, mitigate this issue, increasingly aware, unreliability gap, open and knowledge-intensive
备注: We find that the key to jailbreak the LLM is objectifying its safety responsibility, thus we delegate the open-web to inject harmful semantics and get the huge gain from unmoderated web resources
点击查看摘要
Abstract:Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM's control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM's safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf{\textit{SearchAttack}} for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query's skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.
8. 【2601.04086】KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures
链接:https://arxiv.org/abs/2601.04086
作者:Jinbo Hao,Kai Yang,Qingzhen Su,Yifan Li,Chao Jiang
类目:Computation and Language (cs.CL)
关键词:large language models, large language, focuses on errors, errors induced, framework that focuses
备注:
点击查看摘要
Abstract:To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.
9. 【2601.04073】Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts
链接:https://arxiv.org/abs/2601.04073
作者:Zhihao Zhu,Jiafeng Liang,Shixin Jiang,Jinlan Fu,Ming Liu,Guanglu Sun,See-Kiong Ng,Bing Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Multimodal Models, Large Multimodal, demonstrated impressive capabilities, Multimodal Models, demonstrated impressive
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
10. 【2601.04056】Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion
链接:https://arxiv.org/abs/2601.04056
作者:Yuanfeng Xu,Yuhao Chen,Liang Lin,Guangrun Wang
类目:Computation and Language (cs.CL)
关键词:textbf, hinders the development, autoregressive approaches, Masked Language Models, diffusion approaches
备注: 10 pages, 5 figures
点击查看摘要
Abstract:The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
11. 【2601.04055】Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients
链接:https://arxiv.org/abs/2601.04055
作者:Prith Sharma,Austin Z. Henley
类目:Computation and Language (cs.CL)
关键词:Prompt quality plays, smaller open-source instruction-tuned, controlling the behavior, Prompt, quality plays
备注:
点击查看摘要
Abstract:Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.
12. 【2601.04052】Stable Language Guidance for Vision-Language-Action Models
链接:https://arxiv.org/abs/2601.04052
作者:Zhihao Zhan,Yuhao Chen,Jiaying Zhou,Qinhan Lv,Hao Liu,Keze Wang,Liang Lin,Guangrun Wang
类目:Robotics (cs.RO); Computation and Language (cs.CL)
关键词:generalized robotic control, demonstrated impressive capabilities, remain notoriously brittle, models have demonstrated, robotic control
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose \textbf{Residual Semantic Steering (RSS)}, a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) \textbf{Monte Carlo Syntactic Integration}, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) \textbf{Residual Affordance Steering}, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
13. 【2601.04043】When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life
链接:https://arxiv.org/abs/2601.04043
作者:Xinyue Lou,Jinan Xu,Jingyi Yin,Xiaolong Wang,Zhaolu Kang,Youwei Liao,Yixuan Wang,Xiangyu Shi,Fengran Mo,Su Yao,Kaiyu Huang
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Large Language, perpetually overhanging human, sword of Damocles
备注:
点击查看摘要
Abstract:As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at this https URL.
14. 【2601.04036】Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation
链接:https://arxiv.org/abs/2601.04036
作者:David Stap
类目:Computation and Language (cs.CL)
关键词:representations remains challenging, make knowledge accessible, remains challenging, effective cross-lingual representations, cross-lingual representations remains
备注: PhD dissertation defended on November 26th, 2025
点击查看摘要
Abstract:Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.
15. 【2601.04029】SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency
链接:https://arxiv.org/abs/2601.04029
作者:Jonggeun Lee,Junseong Pyo,Gyuhyeon Seo,Yohan Jo
类目:Computation and Language (cs.CL)
关键词:conversations remains unexplored, multi-turn conversations remains, speech generation quality, Large Audio-Language Models, Large Audio-Language
备注: 28 pages
点击查看摘要
Abstract:Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.
16. 【2601.04025】Simulated Students in Tutoring Dialogues: Substance or Illusion?
链接:https://arxiv.org/abs/2601.04025
作者:Alexander Scarlatos,Jaewook Lee,Simon Woodhead,Andrew Lan
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:large language models, Advances in large, language models, innovations in education, large language
备注:
点击查看摘要
Abstract:Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.
17. 【2601.03997】VotIE: Information Extraction from Meeting Minutes
链接:https://arxiv.org/abs/2601.03997
作者:José Pedro Evans,Luís Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
类目:Computation and Language (cs.CL)
关键词:local democratic processes, Municipal meeting minutes, democratic processes, meeting minutes record, decisions in local
备注:
点击查看摘要
Abstract:Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2\% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.
18. 【2601.03986】Benchmark^2: Systematic Evaluation of LLM Benchmarks
链接:https://arxiv.org/abs/2601.03986
作者:Qi Qian,Chengsong Huang,Jingwen Xu,Changze Lv,Muling Wu,Wenhao Liu,Xiaohua Wang,Zhenghua Wang,Zisu Huang,Muzhao Tian,Jianhan Xu,Kun Hu,He-Da Wang,Yao Hu,Xuanjing Huang,Xiaoqing Zheng
类目:Computation and Language (cs.CL)
关键词:Capability Alignment Deviation, evaluating large language, large language models, Cross-Benchmark Ranking Consistency, rapid proliferation
备注:
点击查看摘要
Abstract:The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.
19. 【2601.03981】RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
链接:https://arxiv.org/abs/2601.03981
作者:Song-Duo Ma,Yi-Hung Liu,Hsin-Yu Lin,Pin-Yu Chen,Hong-Yan Huang,Shau-Yung Hsu,Yun-Nung Chen
类目:Computation and Language (cs.CL)
关键词:LLM-generated misinformation, efficiently combat, combat the spread, spread of LLM-generated, present RADAR
备注:
点击查看摘要
Abstract:To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.
20. 【2601.03979】SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems
链接:https://arxiv.org/abs/2601.03979
作者:Andreea-Elena Bodea,Stephen Meisenbacher,Alexandra Klymenko,Florian Matthes
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Large Language Models, natural language understanding, Language Models, Large Language, rapidly increasing interest
备注: 17 pages, 3 figures, 5 tables. This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2026). The final version will be available on IEEE Xplore
点击查看摘要
Abstract:The continued promise of Large Language Models (LLMs), particularly in their natural language understanding and generation capabilities, has driven a rapidly increasing interest in identifying and developing LLM use cases. In an effort to complement the ingrained "knowledge" of LLMs, Retrieval-Augmented Generation (RAG) techniques have become widely popular. At its core, RAG involves the coupling of LLMs with domain-specific knowledge bases, whereby the generation of a response to a user question is augmented with contextual and up-to-date information. The proliferation of RAG has sparked concerns about data privacy, particularly with the inherent risks that arise when leveraging databases with potentially sensitive information. Numerous recent works have explored various aspects of privacy risks in RAG systems, from adversarial attacks to proposed mitigations. With the goal of surveying and unifying these works, we ask one simple question: What are the privacy risks in RAG, and how can they be measured and mitigated? To answer this question, we conduct a systematic literature review of RAG works addressing privacy, and we systematize our findings into a comprehensive set of privacy risks, mitigation techniques, and evaluation strategies. We supplement these findings with two primary artifacts: a Taxonomy of RAG Privacy Risks and a RAG Privacy Process Diagram. Our work contributes to the study of privacy in RAG not only by conducting the first systematization of risks and mitigations, but also by uncovering important considerations when mitigating privacy risks in RAG systems and assessing the current maturity of proposed mitigations.
21. 【2601.03973】Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
链接:https://arxiv.org/abs/2601.03973
作者:Changhao Jiang,Jiahao Chen,Zhenghao Xiang,Zhixiong Yang,Hanchen Wang,Jiabao Zhuang,Xinmeng Che,Jiajun Sun,Hui Li,Yifei Cao,Shihan Dou,Ming Zhang,Junjie Ye,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:Suno demonstrate strong, Recent commercial systems, hindering fair comparison, demonstrate strong capabilities, remains largely non-reproducible
备注:
点击查看摘要
Abstract:Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text--music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research. The project repository is available at this https URL.
22. 【2601.03969】Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models
链接:https://arxiv.org/abs/2601.03969
作者:Wei Wu,Liyi Chen,Congxi Xiao,Tianfu Wang,Qimeng Wang,Chengqiang Lu,Yan Gao,Yi Wu,Yao Hu,Hui Xiong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:achieved significant performance, significant performance gains, Large reasoning models, Large reasoning, enhanced by reinforcement
备注:
点击查看摘要
Abstract:Large reasoning models enhanced by reinforcement learning with verifiable rewards have achieved significant performance gains by extending their chain-of-thought. However, this paradigm incurs substantial deployment costs as models often exhibit excessive verbosity on simple queries. Existing efficient reasoning methods relying on explicit length penalties often introduce optimization conflicts and leave the generative mechanisms driving overthinking largely unexamined. In this paper, we identify a phenomenon termed length shift where models increasingly generate unnecessary reasoning on trivial inputs during training. To address this, we introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens. This method targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities for complex problems. To complement this intervention and ensure stable convergence, we further incorporate auxiliary KL regularization and predictive dynamic sampling. Experimental results across multiple model scales demonstrate that our approach significantly pushes the efficiency-performance Pareto frontier outward. Notably, on the AIME-24, our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy and surpassing state-of-the-art efficient reasoning methods.
23. 【2601.03940】Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs
链接:https://arxiv.org/abs/2601.03940
作者:Paweł Liskowski,Krzysztof Jankowski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:aspect-based sentiment analysis, real-life aspect-based sentiment, collection of powerful, real-life aspect-based, sentiment analysis
备注:
点击查看摘要
Abstract:We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.
24. 【2601.03938】FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning
链接:https://arxiv.org/abs/2601.03938
作者:Yujie Feng,Hao Wang,Jian Li,Xu Chu,Zhaolu Kang,Yiran Liu,Yasha Wang,Philip S. Yu,Xiao-Ming Wu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:enable sequential knowledge, sequential knowledge acquisition, Continual learning, large language models, aims to enable
备注:
点击查看摘要
Abstract:Continual learning (CL) for large language models (LLMs) aims to enable sequential knowledge acquisition without catastrophic forgetting. Memory replay methods are widely used for their practicality and effectiveness, but most rely on fixed, step-based heuristics that often misalign with the model's actual learning progress, since identical training steps can result in varying degrees of parameter change. Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. FOREVER defines model time using the magnitude of optimizer updates, allowing forgetting curve-inspired replay intervals to align with the model's internal evolution rather than raw training steps. Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay. Extensive experiments on three CL benchmarks and models ranging from 0.6B to 13B parameters demonstrate that FOREVER consistently mitigates catastrophic forgetting.
25. 【2601.03928】FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
链接:https://arxiv.org/abs/2601.03928
作者:Mingyu Ouyang,Kevin Qinghong Lin,Mike Zheng Shou,Hwee Tou Ng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:User Interface, Vision-Language Models, process increasingly high-resolution, increasingly high-resolution screenshots, shown remarkable performance
备注: 14 pages, 13 figures
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
26. 【2601.03926】Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models
链接:https://arxiv.org/abs/2601.03926
作者:Haeun Jang,Hwan Chang,Hwanhee Lee
类目:Computation and Language (cs.CL)
关键词:Large Vision-Language Models, deployment of Large, Large Vision-Language, document question answering, dictate information disclosure
备注:
点击查看摘要
Abstract:The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding
27. 【2601.03914】When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering
链接:https://arxiv.org/abs/2601.03914
作者:Hugh Mee Wong,Rick Nouwen,Albert Gatt
类目:Computation and Language (cs.CL)
关键词:Multiple-choice question answering, conflating reasoning errors, Multiple-choice question, models implement MCQA, question answering
备注: Under review
点击查看摘要
Abstract:Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that *represents* the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning *content position* becomes decodable immediately after the final option is processed, while the *output symbol* is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.
28. 【2601.03908】Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval
链接:https://arxiv.org/abs/2601.03908
作者:Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang
类目:Computation and Language (cs.CL)
关键词:enhances large language, single-path evidence construction, limiting performance gains, existing approaches indiscriminately, approaches indiscriminately trigger
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at this https URL.
29. 【2601.03905】Current Agents Fail to Leverage World Model as Tool for Foresight
链接:https://arxiv.org/abs/2601.03905
作者:Cheng Qian,Emre Can Acikgoz,Bingxuan Li,Xiusi Chen,Yuji Zhang,Bingxiang He,Qinyu Luo,Dilek Hakkani-Tür,Gokhan Tur,Yunzhu Li,Heng Ji,Heng Ji
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:vision-language models increasingly, models increasingly face, increasingly face tasks, demand anticipating future, anticipating future states
备注: 36 Pages, 13 Figures, 17 Tables
点击查看摘要
Abstract:Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
30. 【2601.03895】Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training
链接:https://arxiv.org/abs/2601.03895
作者:Chi Liu,Xin Chen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Group Relative Policy, Relative Policy Optimization, Group Relative, Policy Optimization, Relative Policy
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Group Relative Policy Optimization (GRPO) has emerged as a popular algorithm for reinforcement learning with large language models (LLMs). However, upon analyzing its clipping mechanism, we argue that it is suboptimal in certain scenarios. With appropriate modifications, GRPO can be significantly enhanced to improve both flexibility and generalization. To this end, we propose Adaptive-Boundary-Clipping GRPO (ABC-GRPO), an asymmetric and adaptive refinement of the original GRPO framework. We demonstrate that ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks using the Qwen3 LLMs. Moreover, ABC-GRPO maintains substantially higher entropy throughout training, thereby preserving the model's exploration capacity and mitigating premature convergence. The implementation code is available online to ease reproducibility this https URL.
31. 【2601.03874】Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification
链接:https://arxiv.org/abs/2601.03874
作者:Anthony Lamelas
类目:Computation and Language (cs.CL)
关键词:Large language models, extremely popular recently, popular recently due, computation cost make, Large language
备注: 9 pages, 12 figures
点击查看摘要
Abstract:Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today's LLMs.
32. 【2601.03872】Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning
链接:https://arxiv.org/abs/2601.03872
作者:Jinyang Wu,Guocheng Zhai,Ruihan Jin,Jiahao Yuan,Yuhao Shen,Shuai Zhang,Zhengqi Wen,Jianhua Tao
类目:Computation and Language (cs.CL)
关键词:large language models, integration of large, large language, significantly expanded, expanded the capabilities
备注:
点击查看摘要
Abstract:The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.
33. 【2601.03868】What Matters For Safety Alignment?
链接:https://arxiv.org/abs/2601.03868
作者:Xing Li,Hui-Ling Zhen,Lihao Yin,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:comprehensive empirical study, safety alignment capabilities, paper presents, presents a comprehensive, safety alignment
备注:
点击查看摘要
Abstract:This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.
34. 【2601.03860】PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media
链接:https://arxiv.org/abs/2601.03860
作者:Michele Joshua Maggini,Paloma Piot,Anxo Pérez,Erik Bran Marino,Lúa Santamaría Montesinos,Ana Lisboa,Marta Vázquez Abuín,Javier Parapar,Pablo Gamallo
类目:Computation and Language (cs.CL)
关键词:Replacement Conspiracy Theories, Population Replacement Conspiracy, Conspiracy Theories, Population Replacement, Replacement Conspiracy
备注:
点击查看摘要
Abstract:Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce \textsc{PartisanLens}, the first multilingual dataset of \num{1617} hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation, \textsc{PartisanLens} supports future research on detecting partisan and conspiratorial narratives in European contexts.
35. 【2601.03858】What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs
链接:https://arxiv.org/abs/2601.03858
作者:Seyed Mahed Mousavi,Simone Alghisi,Giuseppe Riccardi
类目:Computation and Language (cs.CL)
关键词:Continual Pre-Training, acquiring and updating, CPT, updating factual knowledge, Continual
备注:
点击查看摘要
Abstract:Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis reveals rapid reconfiguration of knowledge pathways across epochs, providing an explanation for narrow acquisition windows and systematic forgetting. These results show that loss optimization is misaligned with learning progress in CPT and motivate evaluation of stopping criteria based on task-level learning dynamics.
36. 【2601.03851】Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search
链接:https://arxiv.org/abs/2601.03851
作者:Yu Guo,Shenghao Ye,Shuangwu Chen,Zijian Wen,Tao Zhang,Qirui Bai,Dong Jin,Yunpeng Hou,Huasen He,Jian Yang,Xiaobin Tan
类目:Computation and Language (cs.CL)
关键词:Table Question Answering, Question Answering, eliminating redundant cells, Table Question, extracts compact sub-tables
备注: 16 pages, 5 figures
点击查看摘要
Abstract:Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.
37. 【2601.03823】Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning
链接:https://arxiv.org/abs/2601.03823
作者:Fei Wu,Zhenrong Zhang,Qikai Chang,Jianshu Zhang,Quan Liu,Jun Du
类目:Computation and Language (cs.CL)
关键词:outcome-based rewards lead, Reinforcement Learning, Learning with Verifiable, large language models, Verifiable Rewards
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at this https URL.
38. 【2601.03812】AI Generated Text Detection
链接:https://arxiv.org/abs/2601.03812
作者:Adilkhan Alikhanov,Aidar Amangeldi,Diar Demeubay,Dilnaz Akhmetzhan,Nurbek Moldakhmetov,Omar Polat,Galymzhan Zharas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:violates academic integrity, large language models, academic integrity, rapid development, development of large
备注:
点击查看摘要
Abstract:The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
39. 【2601.03798】Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models
链接:https://arxiv.org/abs/2601.03798
作者:Taisiia Tikhomirova,Dirk U. Wulff
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:encode psychologically meaningful, psychologically meaningful aspects, language models encode, models encode psychologically, theory and practice
备注:
点击查看摘要
Abstract:Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning "lives" in transformer models reflects an interaction between methodological choices and architectural constraints.
40. 【2601.03792】VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation
链接:https://arxiv.org/abs/2601.03792
作者:Huynh Trung Kiet,Dao Sy Duy Minh,Nguyen Dinh Ha Duong,Le Hoang Minh Huy,Long Nguyen,Dien Dinh
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, demonstrated remarkable proficiency, Vietnamese Traditional Medicine, demonstrated remarkable
备注: 11 pages, 4 figures. Dataset and code released
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.
41. 【2601.03791】Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework
链接:https://arxiv.org/abs/2601.03791
作者:Xiaoyu Luo,Yiyi Chen,Qiongxiu Li,Johannes Bjerva
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Personally Identifiable Information, Large Language Models, Personally Identifiable, Identifiable Information, Large Language
备注: 20 pages, 13 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have been reported to "leak" Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.
42. 【2601.03790】NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning
链接:https://arxiv.org/abs/2601.03790
作者:Zhongtao Miao,Kaiyan Zhao,Masaaki Nagata,Yoshimasa Tsuruoka
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Neologism-aware machine translation, translate source sentences, Neologism-aware machine, machine translation, machine translation aims
备注:
点击查看摘要
Abstract:Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging "translation difficulty" to further improve the translation quality of translation agents using our search tool.
43. 【2601.03786】Compact Example-Based Explanations for Language Models
链接:https://arxiv.org/abs/2601.03786
作者:Loris Schoenegger,Benjamin Roth
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:estimation methods quantify, influence estimation methods, Training data, Training data influence, data influence estimation
备注: 8 pages
点击查看摘要
Abstract:Training data influence estimation methods quantify the contribution of training documents to a model's output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model's output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model's predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.
44. 【2601.03785】Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents
链接:https://arxiv.org/abs/2601.03785
作者:Dehao Tao,Guoliang Ma,Yongfeng Huang,Minghu Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, continuity-a stable thematic, stable thematic frame, temporally adjacent exchanges-yet, exhibit topic continuity-a
备注:
点击查看摘要
Abstract:Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent "memory boxes" at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.
45. 【2601.03783】HearSay Benchmark: Do Audio LLMs Leak What They Hear?
链接:https://arxiv.org/abs/2601.03783
作者:Jin Wang,Liang Lin,Kaiwen Luo,Weiliu Wang,Yitian Chen,Moayad Aloqaily,Xuehai Tang,Zhenhong Zhou,Kun Wang,Li Sun,Qingsong Wen
类目:Computation and Language (cs.CL)
关键词:Audio Large Language, Large Language Models, Large Language, remain largely unexplored, achieved remarkable progress
备注:
点击查看摘要
Abstract:While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at this https URL
46. 【2601.03779】racing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations
链接:https://arxiv.org/abs/2601.03779
作者:Marco Baroni,Emily Cheng,Iria deDios-Flores,Francesca Franzon
类目:Computation and Language (cs.CL)
关键词:differentially characterize formal, layers differentially characterize, intrinsic dimension, explore the intrinsic, differentially characterize
备注:
点击查看摘要
Abstract:We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.
47. 【2601.03775】Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations
链接:https://arxiv.org/abs/2601.03775
作者:Pingjun Hong,Benjamin Roth
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, true decision process, prior studies suggest, model true decision
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model's true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions, with and without access to the model's chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model's behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.
48. 【2601.03752】Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms
链接:https://arxiv.org/abs/2601.03752
作者:Dominik Macko
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:generate multilingual coherent, multilingual coherent text, recent years, generate multilingual, multilingual coherent
备注:
点击查看摘要
Abstract:Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.
49. 【2601.03746】Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
链接:https://arxiv.org/abs/2601.03746
作者:Jakob Schuster,Vagrant Gautam,Katja Markert
类目:Computation and Language (cs.CL)
关键词:large language models, retrieval-augmented generation pipelines, language models, generation pipelines, large language
备注: Data and code: [this https URL](https://github.com/JaSchuste/llm-source-preference)
点击查看摘要
Abstract:As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.
50. 【2601.03743】O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL
链接:https://arxiv.org/abs/2601.03743
作者:Yi Yao,He Zhu,Piaohong Wang,Jincheng Ren,Xinlong Yang,Qianben Chen,Xiaowan Li,Dingfeng Shi,Jiaxian Li,Qiexiang Wang,Sinuo Wang,Xinpeng Liu,Jiaqi Wu,Minghao Liu,Wangchunshu Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, open-source large language, large language, largely attributed, attributed to disparities
备注: 22 pages
点击查看摘要
Abstract:The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.
51. 【2601.03733】RadDiff: Describing Differences in Radiology Image Sets with Natural Language
链接:https://arxiv.org/abs/2601.03733
作者:Xiaoxian Shen,Yuhui Zhang,Sahithi Ankireddy,Xiaohan Wang,Maya Varma,Henry Guo,Curtis Langlotz,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:image sets differ, generating clinical insights, radiology image sets, sets differ, differ is critical
备注:
点击查看摘要
Abstract:Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
52. 【2601.03727】Stuttering-Aware Automatic Speech Recognition for Indonesian Language
链接:https://arxiv.org/abs/2601.03727
作者:Fadhil Muhammad,Alwin Djuliansah,Adrian Aryaputra Hamzah,Kurniawati Azizah
类目:Computation and Language (cs.CL)
关键词:achieved remarkable performance, Automatic speech recognition, virtually non-existent, Automatic speech, systems have achieved
备注: Preprint
点击查看摘要
Abstract:Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.
53. 【2601.03717】MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation
链接:https://arxiv.org/abs/2601.03717
作者:Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, practical resource constraints, smaller models
备注: 13 pages, 8 figures
点击查看摘要
Abstract:While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student's evolving capacity and reasoning preferences during training, a teacher's "optimal" rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student's latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel "Teaching Assistant" network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student's current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.
54. 【2601.03714】Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
链接:https://arxiv.org/abs/2601.03714
作者:Yunhao Liang,Ruixuan Ying,Bo Li,Hong Li,Kai Yan,Qingwen Li,Min Yang,Okamoto Satoshi,Zhe Cui,Shiwen Ni
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:exceeding ten times, achieve high-ratio vision-text, tokens exceeding ten, mapping approach, claiming to decode
备注:
点击查看摘要
Abstract:DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at this https URL.
55. 【2601.03707】AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions
链接:https://arxiv.org/abs/2601.03707
作者:Hengxing Cai,Yijie Rao,Ligang Huang,Zanyang Zhong,Jinhan Dong,Jingjun Tan,Wenhao Lu,Renxin Zhong
类目:Computation and Language (cs.CL)
关键词:Unmanned Aerial Vehicle, Existing Unmanned Aerial, Existing Unmanned, Vision-Language Navigation, large-scale UAV VLN
备注:
点击查看摘要
Abstract:Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.
56. 【2601.03700】ADEPT: Adaptive Dynamic Early-Exit Process for Transformers
链接:https://arxiv.org/abs/2601.03700
作者:Sangmin Yoo,Srikanth Malla,Chiho Choi,Wei D. Lu,Joon Hee Choi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:models imposes significant, significant computational workloads, imposes significant computational, large language models, language models imposes
备注: 11 figures, 8 tables, 22 pages
点击查看摘要
Abstract:The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.
57. 【2601.03699】RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
链接:https://arxiv.org/abs/2601.03699
作者:Quy-Anh Dang,Chris Ngo,Truong-Son Hy
类目:Computation and Language (cs.CL)
关键词:large language models, language models, safety-critical applications, ensuring their robustness, large language
备注:
点击查看摘要
Abstract:As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: this https URL
58. 【2601.03698】Evaluation Framework for AI Creativity: A Case Study Based on Story Generation
链接:https://arxiv.org/abs/2601.03698
作者:Pharath Sathya,Yin Jou Huang,Fei Cheng
类目:Computation and Language (cs.CL)
关键词:Evaluating creative text, text generation remains, existing reference-based metrics, reference-based metrics fail, Evaluating creative
备注: Work in progress
点击查看摘要
Abstract:Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.
59. 【2601.03682】From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs
链接:https://arxiv.org/abs/2601.03682
作者:Shaojie Wang,Liang Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent studies reveal, Recent studies, logical relationship understanding, exhibit limited logical, logical reasoning abilities
备注:
点击查看摘要
Abstract:Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90\% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2\% and 4.6\%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80\%.
60. 【2601.03676】owards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis
链接:https://arxiv.org/abs/2601.03676
作者:Yifan Wei,Li Du,Xiaoyan Yu,Yang Feng,Angsheng Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, power-law distribution, Entropy-based Post-training data
备注: The code and data for our methods and experiments are available at [this https URL](https://github.com/weiyifan1023/STEPS)
点击查看摘要
Abstract:Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.
61. 【2601.03672】Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction
链接:https://arxiv.org/abs/2601.03672
作者:Chen Zhang,Kepu Zhang,Jiatong Zhang,Xiao Zhang,Jun Xu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:critical entry point, demanding high accuracy, high accuracy strictly, modern search pipelines, real-time latency constraints
备注:
点击查看摘要
Abstract:Query correction is a critical entry point in modern search pipelines, demanding high accuracy strictly within real-time latency constraints. Chain-of-Thought (CoT) reasoning improves accuracy but incurs prohibitive latency for real-time query correction. A potential solution is to output an answer before reasoning to reduce latency; however, under autoregressive decoding, the early answer is independent of subsequent reasoning, preventing the model from leveraging its reasoning capability to improve accuracy. To address this issue, we propose Sandwich Reasoning (SandwichR), a novel approach that explicitly aligns a fast initial answer with post-hoc reasoning, enabling low-latency query correction without sacrificing reasoning-aware accuracy. SandwichR follows an Answer-Reasoning-Answer paradigm, producing an initial correction, an explicit reasoning process, and a final refined correction. To align the initial answer with post-reasoning insights, we design a consistency-aware reinforcement learning (RL) strategy: a dedicated consistency reward enforces alignment between the initial and final corrections, while margin-based rejection sampling prioritizes borderline samples where reasoning drives the most impactful corrective gains. Additionally, we construct a high-quality query correction dataset, addressing the lack of specialized benchmarks for complex query correction. Experimental results demonstrate that SandwichR achieves SOTA accuracy comparable to standard CoT while delivering a 40-70% latency reduction, resolving the latency-accuracy trade-off in online search.
62. 【2601.03671】NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models
链接:https://arxiv.org/abs/2601.03671
作者:Weiqi Liu,Yongliang Miao,Haiyan Zhao,Yanguang Liu,Mengnan Du
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, Neuron-level interpretation, individual neurons respond, language models, large language
备注:
点击查看摘要
Abstract:Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
63. 【2601.03670】DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management
链接:https://arxiv.org/abs/2601.03670
作者:Zhitong Chen,Kai Yin,Xiangjue Dong,Chengkai Liu,Xiangpeng Li,Yiming Xiao,Bo Li,Junwei Ma,Ali Mostafavi,James Caverlee
类目:Computation and Language (cs.CL)
关键词:Accurate question answering, Accurate question, disaster management requires, existing benchmarks built, management requires reasoning
备注:
点击查看摘要
Abstract:Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at this https URL.
64. 【2601.03669】racer: Towards Traceable Text Generation via Claim-Level Grounding
链接:https://arxiv.org/abs/2601.03669
作者:Bohao Chu,Qianli Wang,Hendrik Damm,Hui Wang,Ula Muhabbek,Elisabeth Livingstone,Christoph M. Friedrich,Norbert Fuhr
类目:Computation and Language (cs.CL)
关键词:high-stakes biomedical domain, efficiently verified, biomedical domain, high-stakes biomedical, grounding
备注: ACL 2026 Conference Submission (8 main pages)
点击查看摘要
Abstract:How can system-generated responses be efficiently verified, especially in the high-stakes biomedical domain? To address this challenge, we introduce eTracer, a plug-and-play framework that enables traceable text generation by grounding claims against contextual evidence. Through post-hoc grounding, each response claim is aligned with contextual evidence that either supports or contradicts it. Building on claim-level grounding results, eTracer not only enables users to precisely trace responses back to their contextual source but also quantifies response faithfulness, thereby enabling the verifiability and trustworthiness of generated responses. Experiments show that our claim-level grounding approach alleviates the limitations of conventional grounding methods in aligning generated statements with contextual sentence-level evidence, resulting in substantial improvements in overall grounding quality and user verification efficiency. The code and data are available at this https URL.
65. 【2601.03666】5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
链接:https://arxiv.org/abs/2601.03666
作者:Haonan Chen,Sicheng Gao,Radu Timofte,Tetsuya Sakai,Zhicheng Dou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern information systems, Modern information, types of items, text query, video clip
备注:
点击查看摘要
Abstract:Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at this https URL.
66. 【2601.03649】SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation
链接:https://arxiv.org/abs/2601.03649
作者:Gengyang Li,Wang Cai,Yifeng Gao,Yunfang Wu
类目:Computation and Language (cs.CL)
关键词:increase inference cost, substantially increase inference, prompting improves reasoning, prompting improves, inference cost
备注: 14 pages, 8 figures
点击查看摘要
Abstract:Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
67. 【2601.03648】ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs
链接:https://arxiv.org/abs/2601.03648
作者:HanGyeol Yoo,ChangSu Choi,Minjun Kim,Seohyun Song,SeungWoo Song,Inho Won,Jongyoul Park,Cheoneum Park,KyungTae Lim
类目:Computation and Language (cs.CL)
关键词:efficient layer-specific optimization, enhance continual pretraining, multilingual large language, large language models, layer-specific optimization
备注: 12 pages, Accepted to EACL 2026 (Industrial Track)
点击查看摘要
Abstract:We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2\% on qualitative benchmarks and effectively preserving source language (English) capabilities.
68. 【2601.03645】LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight
链接:https://arxiv.org/abs/2601.03645
作者:Yu-Zheng Lin,Bono Po-Jen Shih,John Paul Martin Encinas,Elizabeth Victoria Abraham Achom,Karan Himanshu Patel,Jesus Horacio Pacheco,Sicong Shao,Jyotikrishna Dass,Soheil Salehi,Pratik Satam
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Emotional coordination, real time, core property, property of human, shapes how relational
备注:
点击查看摘要
Abstract:Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.
69. 【2601.03641】Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning
链接:https://arxiv.org/abs/2601.03641
作者:Zheng Wu,Xingyu Lou,Xinbei Ma,Yansi Li,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, Language Model, agents significantly extend, dynamic environments
备注:
点击查看摘要
Abstract:Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.
70. 【2601.03630】Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
链接:https://arxiv.org/abs/2601.03630
作者:Hui Huang,Xuanxin Wu,Muyun Yang,Yuki Arase
类目:Computation and Language (cs.CL)
关键词:Large Reasoning Models, Large Reasoning, systematic comparison investigating, investigating whether Large, Reasoning Models
备注: 11 pages, 4 figures
点击查看摘要
Abstract:This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.
71. 【2601.03627】Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines
链接:https://arxiv.org/abs/2601.03627
作者:Jean Seo,Gibaeg Kim,Kihun Shin,Seungseop Lim,Hyunkyung Lee,Wooseok Han,Jongwon Lee,Eunho Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:introduce EPAG, designed for Evaluating, Pre-consultation Ability, framework designed, Ability of LLMs
备注: EACL 2026 Industry
点击查看摘要
Abstract:We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on this https URL, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.
72. 【2601.03615】Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation
链接:https://arxiv.org/abs/2601.03615
作者:Binh Nguyen,Thai Le
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Audio Language Models, Language Models, Audio Language, offer a promising, classifiers by providing
备注: Preprint for ACL 2026 submission
点击查看摘要
Abstract:Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textit{black-box} classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs' reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textit{``shield''}, protecting them from adversarial attacks. However, for others, it imposes a performance \textit{``tax''}, particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textit{silent alarm}, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.
73. 【2601.03605】DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier
链接:https://arxiv.org/abs/2601.03605
作者:Hui Huang,Muyun Yang,Yuki Arase
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, fueling growing interest, advancements of Large, Language Models
备注:
点击查看摘要
Abstract:Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.
74. 【2601.03597】From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs
链接:https://arxiv.org/abs/2601.03597
作者:Yingjian Chen,Haoran Liu,Yinhong Liu,Sherry T. Tong,Aosong Feng,Jinghui Lu,Juntao Zhang,Yusuke Iwasawa,Yutaka Matsuo,Irene Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, strong reasoning ability, Language Models, reasoning
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.
75. 【2601.03595】Controllable LLM Reasoning via Sparse Autoencoder-Based Steering
链接:https://arxiv.org/abs/2601.03595
作者:Yi Fang,Wenjie Wang,Mingfeng Xue,Boyi Deng,Fengli Xu,Dayiheng Liu,Fuli Feng
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Reasoning Models, exhibit human-like cognitive, human-like cognitive reasoning, Large Reasoning, Reasoning Models
备注: Under Review
点击查看摘要
Abstract:Large Reasoning Models (LRMs) exhibit human-like cognitive reasoning strategies (e.g. backtracking, cross-verification) during reasoning process, which improves their performance on complex tasks. Currently, reasoning strategies are autonomously selected by LRMs themselves. However, such autonomous selection often produces inefficient or even erroneous reasoning paths. To make reasoning more reliable and flexible, it is important to develop methods for controlling reasoning strategies. Existing methods struggle to control fine-grained reasoning strategies due to conceptual entanglement in LRMs' hidden states. To address this, we leverage Sparse Autoencoders (SAEs) to decompose strategy-entangled hidden states into a disentangled feature space. To identify the few strategy-specific features from the vast pool of SAE features, we propose SAE-Steering, an efficient two-stage feature identification pipeline. SAE-Steering first recalls features that amplify the logits of strategy-specific keywords, filtering out over 99\% of features, and then ranks the remaining features by their control effectiveness. Using the identified strategy-specific features as control vectors, SAE-Steering outperforms existing methods by over 15\% in control effectiveness. Furthermore, controlling reasoning strategies can redirect LRMs from erroneous paths to correct ones, achieving a 7\% absolute accuracy improvement.
76. 【2601.03589】OLA: Output Language Alignment in Code-Switched LLM Interactions
链接:https://arxiv.org/abs/2601.03589
作者:Juhyun Oh,Haneul Yoo,Faiz Ghifari Haznitrama,Alice Oh
类目:Computation and Language (cs.CL)
关键词:poses fundamental challenges, challenges for large, language, output language, LLMs
备注:
点击查看摘要
Abstract:Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions. OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users' implicit expectations in real-world code-switched interactions.
77. 【2601.03578】PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics
链接:https://arxiv.org/abs/2601.03578
作者:Yaling Shen,Stephanie Fong,Yiwen Jiang,Zimu Wang,Feilong Tang,Qingyang Xu,Xiangyu Zhao,Zhongxing Xu,Jiahe Liu,Jinpeng Hu,Dominic Dwyer,Zongyuan Ge
类目:Computation and Language (cs.CL)
关键词:applications necessitates robust, necessitates robust frameworks, evaluating professional safety, health applications necessitates, large language models
备注: 17 pages
点击查看摘要
Abstract:The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \texttt{PsychEthicsBench}, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs' ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.
78. 【2601.03570】How Do Large Language Models Learn Concepts During Continual Pre-Training?
链接:https://arxiv.org/abs/2601.03570
作者:Barry Menglong Yao(1),Sha Li(2),Yunzhi Yao(3),Minqian Liu(2),Zaishuo Xia(1),Qifan Wang(4),Lifu Huang(1) ((1) UC Davis, (2) Virginia Tech, (3) UCLA, (4) Meta AI)
类目:Computation and Language (cs.CL)
关键词:abstract mental representations, Human beings primarily, abstract mental, primarily understand, understand the world
备注: 12 pages, 19 figures
点击查看摘要
Abstract:Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs' internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.
79. 【2601.03559】DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs
链接:https://arxiv.org/abs/2601.03559
作者:Shidong Cao,Hongzhan Lin,Yuxuan Gu,Ziyang Luo,Jing Ma
类目:Computation and Language (cs.CL)
关键词:mathematical problem solving, early mistakes propagate, mistakes propagate irreversibly, improves multi-step mathematical, multi-step mathematical problem
备注: DiffCoT improves multi-step LLM reasoning by applying diffusion-based iterative denoising to correct intermediate Chain-of-Thought steps
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.
80. 【2601.03553】Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios
链接:https://arxiv.org/abs/2601.03553
作者:Sangyub Lee,Heedou Kim,Hyeoncheol Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models, Large Language, operations remains absent, remains absent
备注: This work was accepted at AAAI 2026 social good track
点击查看摘要
Abstract:The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM's responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.
81. 【2601.03549】EASLT: Emotion-Aware Sign Language Translation
链接:https://arxiv.org/abs/2601.03549
作者:Guobin Tu,Di Weng
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:complex cross-modal task, cross-modal task requiring, Non-Manual Signals, Sign Language Translation, Manual Signals
备注:
点击查看摘要
Abstract:Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present **EASLT** (**E**motion-**A**ware **S**ign **L**anguage **T**ranslation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel *Emotion-Aware Fusion* (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at this https URL.
82. 【2601.03546】Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict
链接:https://arxiv.org/abs/2601.03546
作者:Guanyu Chen,Chenxiao Yu,Xiyang Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Large language models, simulate decision-making tasks, decision-making tasks involving, tasks involving personal, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model's expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.
83. 【2601.03543】EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
链接:https://arxiv.org/abs/2601.03543
作者:Ye Shen,Dun Pei,Yiqiu Guo,Junying Wang,Yijin Guo,Zicheng Zhang,Qi Jia,Jun Zhou,Guangtao Zhai
类目:Computation and Language (cs.CL)
关键词:large language models, leveraging long-range conversational, long-range conversational memory, lack systematic evaluation, language models
备注: 14 pages, 7 figures, 8 tables
点击查看摘要
Abstract:Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs' capabilities and often exhibit notable efficiency limitations. Data and code will be released at this https URL.
84. 【2601.03542】Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models
链接:https://arxiv.org/abs/2601.03542
作者:Xukai Liu,Ye Liu,Jipeng Zhang,Yanghai Zhang,Kai Zhang,Qi Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, facts remains unclear, internally compose multiple, compose multiple facts, multiple facts remains
备注: 16 pages, 18 figures
点击查看摘要
Abstract:Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emph{hop-aligned circuit hypothesis}, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emph{layer-order inversion}, which strengthens with total hops. To explain this behavior, we propose a \emph{probabilistic recall-and-extract} framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at this https URL.
85. 【2601.03540】DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing
链接:https://arxiv.org/abs/2601.03540
作者:Hongzhi Zhang,Yuanze Hu,Tinghai Zhang,Jia Fu,Tao Wang,Junwei Jing,Zhaoxin Fan,Qi Wang,Ruiming Tang,Han Li,Guorui Zhou,Kun Gai
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, evolution of Large, progress in Deep
备注:
点击查看摘要
Abstract:The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.
86. 【2601.03537】STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules
链接:https://arxiv.org/abs/2601.03537
作者:Di Wu,Yanyan Zhao,Xin Lu,Mingzhe Li,Bing Qin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, deployment of Large, Defending against jailbreak, Language Models
备注: 19 pages,4 figures
点击查看摘要
Abstract:Defending against jailbreak attacks is crucial for the safe deployment of Large Language Models (LLMs). Recent research has attempted to improve safety by training models to reason over safety rules before responding. However, a key issue lies in determining what form of safety reasoning effectively defends against jailbreak attacks, which is difficult to explicitly design or directly obtain. To address this, we propose \textbf{STAR-S} (\textbf{S}elf-\textbf{TA}ught \textbf{R}easoning based on \textbf{S}afety rules), a framework that integrates the learning of safety rule reasoning into a self-taught loop. The core of STAR-S involves eliciting reasoning and reflection guided by safety rules, then leveraging fine-tuning to enhance safety reasoning. Repeating this process creates a synergistic cycle. Improvements in the model's reasoning and interpretation of safety rules allow it to produce better reasoning data under safety rule prompts, which is then utilized for further training. Experiments show that STAR-S effectively defends against jailbreak attacks, outperforming baselines. Code is available at: this https URL.
87. 【2601.03534】Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach
链接:https://arxiv.org/abs/2601.03534
作者:Yilong Dai,Ziyi Wang,Chenguang Wang,Kexin Zhou,Yiheng Qian,Susu Xu,Xiang Yan
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:creating cyclist-friendly cities, advancing sustainable urban, sustainable urban transportation, requires incorporating users', incorporating users' perceptions
备注:
点击查看摘要
Abstract:Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
88. 【2601.03531】PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models
链接:https://arxiv.org/abs/2601.03531
作者:Yuwen Wang,Xinyuan Qian,Tian-Hao Zhang,Jiaran Gao,Yuchen Pan,Xin Wang,Zhou Pan,Chen Wei,Yiming Wang
类目:Computation and Language (cs.CL)
关键词:Large Audio-Language Models, demonstrated strong performance, Large Audio-Language, Audio-Language Models, understanding and generation
备注: Under review
点击查看摘要
Abstract:Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.
89. 【2601.03515】Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
链接:https://arxiv.org/abs/2601.03515
作者:Yuanchen Bei,Tianxin Wei,Xuying Ning,Yanjun Zhao,Zhining Liu,Xiao Lin,Yada Zhu,Hendrik Hamann,Jingrui He,Hanghang Tong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multimodal large language, large language model, evolves over time, long-term conversational memory, memory
备注: 34 pages, 18 figures
点击查看摘要
Abstract:Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.
90. 【2601.03511】IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation
链接:https://arxiv.org/abs/2601.03511
作者:Hossein Hosseini Kasnavieh,Gholamreza Haffari,Chris Leckie,Adel N. Toosi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:produce sufficiently high-quality, sufficiently high-quality output, specific LLM, major challenge, produce sufficiently
备注:
点击查看摘要
Abstract:A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.
91. 【2601.03506】Reasoning Pattern Alignment Merging for Adaptive Reasoning
链接:https://arxiv.org/abs/2601.03506
作者:Zhaofeng Zhong,Wei Yuan,Tong Chen,Xiangyu Zhao,Quoc Viet Hung Nguyen,Hongzhi Yin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent large reasoning, incurring unnecessary computation, made substantial progress, Recent large, complex reasoning tasks
备注: 16 pages, 4 figures
点击查看摘要
Abstract:Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model's intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.
92. 【2601.03505】Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning
链接:https://arxiv.org/abs/2601.03505
作者:Soheil Zibakhsh Shabgahi,Pedram Aghazadeh,Farinaz Koushanfar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models, Large Language, injecting domain knowledge, standard approach
备注:
点击查看摘要
Abstract:Supervised Fine-Tuning (SFT) is a standard approach for injecting domain knowledge into Large Language Models (LLMs). However, relying on validation perplexity to monitor training is often insufficient, as it confounds stylistic mimicry with genuine factual internalization. To address this, we introduce the Knowledge Retention (KR) Test , a lightweight, corpus-grounded evaluation framework designed to distinguish factual learning from linguistics. KR-Test utilizes automatically generated contrastive examples to measure likelihood preferences for correct versus incorrect continuations, requiring no instruction tuning or generative decoding. We validate the framework's integrity through a "blind vs. oracle" baseline analysis. Furthermore, we demonstrate the diagnostic capabilities of KR-Test by analyzing the training dynamics of Low-Rank Adaptation (LoRA). By exposing the fine-grained dissociation between linguistic convergence and knowledge retention, KR-Test enhances the interpretability of fine-tuning dynamics.
93. 【2601.03496】STELLA: Self-Reflective Terminology-Aware Framework for Building an Aerospace Information Retrieval Benchmark
链接:https://arxiv.org/abs/2601.03496
作者:Bongmin Kim
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Aerospace Information Retrieval, public information retrieval, aerospace industry heavily, industry heavily rely, Information Retrieval Benchmark
备注: 25 pages, 2 figures
点击查看摘要
Abstract:Tasks in the aerospace industry heavily rely on searching and reusing large volumes of technical documents, yet there is no public information retrieval (IR) benchmark that reflects the terminology- and query-intent characteristics of this domain. To address this gap, this paper proposes the STELLA (Self-Reflective TErminoLogy-Aware Framework for BuiLding an Aerospace Information Retrieval Benchmark) framework. Using this framework, we introduce the STELLA benchmark, an aerospace-specific IR evaluation set constructed from NASA Technical Reports Server (NTRS) documents via a systematic pipeline that comprises document layout detection, passage chunking, terminology dictionary construction, synthetic query generation, and cross-lingual extension. The framework generates two types of queries: the Terminology Concordant Query (TCQ), which includes the terminology verbatim to evaluate lexical matching, and the Terminology Agnostic Query (TAQ), which utilizes the terminology's description to assess semantic matching. This enables a disentangled evaluation of the lexical and semantic matching capabilities of embedding models. In addition, we combine Chain-of-Density (CoD) and the Self-Reflection method with query generation to improve quality and implement a hybrid cross-lingual extension that reflects real user querying practices. Evaluation of seven embedding models on the STELLA benchmark shows that large decoder-based embedding models exhibit the strongest semantic understanding, while lexical matching methods such as BM25 remain highly competitive in domains where exact lexical matching technical term is crucial. The STELLA benchmark provides a reproducible foundation for reliable performance evaluation and improvement of embedding models in aerospace-domain IR tasks. The STELLA benchmark can be found in this https URL.
94. 【2601.03493】Submodular Evaluation Subset Selection in Automatic Prompt Optimization
链接:https://arxiv.org/abs/2601.03493
作者:Jinming Nian,Zhiyuan Peng,Hongwei Shang,Dae Hoon Park,Yi Fang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:task performance measured, Automatic prompt optimization, optimization reduces manual, randomly sampled evaluation, manual prompt engineering
备注:
点击查看摘要
Abstract:Automatic prompt optimization reduces manual prompt engineering, but relies on task performance measured on a small, often randomly sampled evaluation subset as its main source of feedback signal. Despite this, how to select that evaluation subset is usually treated as an implementation detail. We study evaluation subset selection for prompt optimization from a principled perspective and propose SESS, a submodular evaluation subset selection method. We frame selection as maximizing an objective set function and show that, under mild conditions, it is monotone and submodular, enabling greedy selection with theoretical guarantees. Across GSM8K, MATH, and GPQA-Diamond, submodularly selected evaluation subsets can yield better optimized prompts than random or heuristic baselines.
95. 【2601.03483】CALM: Culturally Self-Aware Language Models
链接:https://arxiv.org/abs/2601.03483
作者:Lingzhi Shen,Xiaohao Cai,Yunfei Long,Imran Razzak,Guanming Chen,Shoaib Jameel
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:diverse cultural contexts, capacity to understand, understand and adapt, adapt to diverse, Cultural
备注:
点击查看摘要
Abstract:Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.
96. 【2601.03481】Self-Explaining Hate Speech Detection with Moral Rationales
链接:https://arxiv.org/abs/2601.03481
作者:Francielle Vargas,Jackson Trager,Diego Alves,Surendrabikram Thapa,Matteo Guida,Berk Atil,Daryna Dementieva,Andrew Smart,Ameeta Agrawal
类目:Computation and Language (cs.CL)
关键词:Hate speech detection, surface-level lexical features, speech detection, Hate speech, cultural contextualization
备注:
点击查看摘要
Abstract:Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs
97. 【2601.03474】SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation
链接:https://arxiv.org/abs/2601.03474
作者:José Isidro,Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:semantically meaningful units, dividing continuous text, natural language processing, Linear text segmentation, language processing
备注:
点击查看摘要
Abstract:Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.
98. 【2601.03471】EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
链接:https://arxiv.org/abs/2601.03471
作者:Mingyang Wei,Dehai Min,Zewen Liu,Yuzhang Xie,Guanchen Wu,Carl Yang,Max S. Y. Lau,Qi He,Lu Cheng,Wei Jin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:requires synthesizing study, infer disease burden, transmission dynamics, Reliable epidemiological reasoning, Reliable epidemiological
备注: 21 pages, 3 figures, 12 tables
点击查看摘要
Abstract:Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.
99. 【2601.03464】Prompting Underestimates LLM Capability for Time Series Classification
链接:https://arxiv.org/abs/2601.03464
作者:Dan Schumacher,Erfan Nourbakhsh,Rocky Slavin,Anthony Rios
类目:Computation and Language (cs.CL)
关键词:meaningful temporal structure, encode meaningful temporal, large language models, raising doubts, temporal structure
备注: 8 pages + Appendix and References, 9 figures
点击查看摘要
Abstract:Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.
100. 【2601.03448】Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
链接:https://arxiv.org/abs/2601.03448
作者:Atsuki Yamaguchi,Maggie Mi,Nikolaos Aletras
类目:Computation and Language (cs.CL)
关键词:generate text sequences, datasets to generate, Language models, text sequences, Abstract
备注:
点击查看摘要
Abstract:Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
101. 【2601.03444】Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale
链接:https://arxiv.org/abs/2601.03444
作者:Weiyue Li,Minda Zhao,Weixuan Dong,Jiahui Cai,Yuze Wei,Michael Pocress,Yi Li,Wanyan Yuan,Xiaoyue Wang,Ruoyu Hou,Kaiyuan Lou,Wenqi Zeng,Yutong Yang,Yilun Du,Mengyu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large language models, Large language, prior works demonstrate, language models, automated evaluators
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.
102. 【2601.03435】he Critical Role of Aspects in Measuring Document Similarity
链接:https://arxiv.org/abs/2601.03435
作者:Eftekhar Hossain,Tarnika Hazra,Ahatesham Bhuiyan,Santu Karmaker
类目:Computation and Language (cs.CL)
关键词:requires conditioning document, traditional holistic approach, measuring document similarity, conditioning document similarity, interpretable framework
备注: 24 Pages, 10 Figures, 10 Tables
点击查看摘要
Abstract:We introduce ASPECTSIM, a simple and interpretable framework that requires conditioning document similarity on an explicitly specified aspect, which is different from the traditional holistic approach in measuring document similarity. Experimenting with a newly constructed benchmark of 26K aspect-document pairs, we found that ASPECTSIM, when implemented with direct GPT-4o prompting, achieves substantially higher human-machine agreement ($\approx$80% higher) than the same for holistic similarity without explicit aspects. These findings underscore the importance of explicitly accounting for aspects when measuring document similarity and highlight the need to revise standard practice. Next, we conducted a large-scale meta-evaluation using 16 smaller open-source LLMs and 9 embedding models with a focus on making ASPECTSIM accessible and reproducible. While directly prompting LLMs to produce ASPECTSIM scores turned out be ineffective (20-30% human-machine agreement), a simple two-stage refinement improved their agreement by $\approx$140%. Nevertheless, agreement remains well below that of GPT-4o-based models, indicating that smaller open-source LLMs still lag behind large proprietary models in capturing aspect-conditioned similarity.
103. 【2601.03424】Spectral Archaeology: The Causal Topology of Model Evolution
链接:https://arxiv.org/abs/2601.03424
作者:Valentin Noël
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Behavioral benchmarks, Behavioral, textit, PTCC, cs.LG
备注: 45 pages, 15 figures, Under Review
点击查看摘要
Abstract:Behavioral benchmarks tell us \textit{what} a model does, but not \textit{how}. We introduce a training-free mechanistic probe using attention-graph spectra. Treating each layer as a token graph, we compute algebraic connectivity ($\lambda_2$), smoothness, and spectral entropy. Across 12 models and 10 languages, these measures yield stable ``spectral fingerprints'' that expose discontinuities missed by standard evaluation. We report four results. (1) Models undergoing specific curriculum transitions (e.g., code-to-chat) show an English-only, syntax-triggered connectivity failure on non-canonical constructions, reaching $\Delta\lambda_2 \approx -0.76$. We term this scar \textit{Passive-Triggered Connectivity Collapse} (PTCC). Analysis of the Phi lineage reveals that PTCC appears and resolves across developmental stages, implicating brittle curriculum shifts rather than synthetic data per se. (2) PTCC reflects a specialization trade-off: strengthened formal routing at the expense of stylistic flexibility. (3) We identify four recurrent processing strategies; simple frozen-threshold rules enable perfect forensic identification across lineages. (4) Mechanistically, PTCC localizes to a sparse Layer 2 ``compensatory patch'' of heads that fails under syntactic stress; activation steering can partially restore connectivity, recovering $\approx 38\%$ of lost information flow. Finally, dominant topological regimes track tokenization density more than language identity, suggesting ``healthy'' geometry varies systematically across scripts. Overall, attention-graph spectra provide a practical tool for auditing and training-regime verification.
Comments:
45 pages, 15 figures, Under Review
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2601.03424 [cs.LG]
(or
arXiv:2601.03424v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2601.03424
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
104. 【2601.03423】raining-Free Adaptation of New-Generation LLMs using Legacy Clinical Models
链接:https://arxiv.org/abs/2601.03423
作者:Sasha Ronaghi,Chloe Stanwyck,Asad Aali,Amir Ronaghi,Miguel Fuentes,Tina Hernandez-Boussard,Emily Alsentzer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:fine-tuning requires costly, requires costly retraining, Adapting language models, Adapting language, domain through continued
备注: 29 pages, 3 figures
点击查看摘要
Abstract:Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.
105. 【2601.03418】PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution
链接:https://arxiv.org/abs/2601.03418
作者:Bohao Chu,Sameh Frihat,Tabea M. G. Pakull,Hendrik Damm,Meijie Li,Ula Muhabbek,Georg Lodde,Norbert Fuhr
类目:Computation and Language (cs.CL)
关键词:effective verification requires, verification requires precise, high-stakes medical domains, requires precise attribution, summaries remains challenging
备注: ACL 2026 Conference Submission (8 main pages)
点击查看摘要
Abstract:Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at this https URL.
106. 【2601.03417】Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models
链接:https://arxiv.org/abs/2601.03417
作者:Xin Zhang,Kailai Yang,Hao Li,Chenyue Li,Qiyu Wei,Sophia Ananiadou
类目:Computation and Language (cs.CL)
关键词:applications increasingly require, Long-horizon applications increasingly, increasingly require large, require large language, long contexts
备注: 11 pages, 5 figures
点击查看摘要
Abstract:Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.
107. 【2601.03403】grinya Number Verbalization: Rules, Algorithm, and Implementation
链接:https://arxiv.org/abs/2601.03403
作者:Fitsum Gaim,Issayas Tesfamariam
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:ordinal number verbalization, present a systematic, systematic formalization, cardinal and ordinal, computational resources
备注:
点击查看摘要
Abstract:We present a systematic formalization of Tigrinya cardinal and ordinal number verbalization, addressing a gap in computational resources for the language. This work documents the canonical rules governing the expression of numerical values in spoken Tigrinya, including the conjunction system, scale words, and special cases for dates, times, and currency. We provide a formal algorithm for number-to-word conversion and release an open-source implementation. Evaluation of frontier large language models (LLMs) reveals significant gaps in their ability to accurately verbalize Tigrinya numbers, underscoring the need for explicit rule documentation. This work serves language modeling, speech synthesis, and accessibility applications targeting Tigrinya-speaking communities.
108. 【2601.03401】Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms
链接:https://arxiv.org/abs/2601.03401
作者:Ruihan Zhang,Jun Sun
类目:Computation and Language (cs.CL)
关键词:Large language models, heterogeneous text corpora, Large language, raising serious concerns, proprietary or personal
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models' own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such data exhibit substantial and systematic performance degradation compared to standard fine-tuning. Our results identify alignment behaviour as a previously unexplored lever for data protection and, to our knowledge, present the first practical method for restricting data learnability at LLM scale without requiring access to or modification of the training pipeline.
109. 【2601.03396】Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation
链接:https://arxiv.org/abs/2601.03396
作者:Maan Qraitem,Kate Saenko,Bryan A. Plummer
类目:Computation and Language (cs.CL)
关键词:Procedural content generation, generation remains underexplored, enabled vast virtual, vast virtual worlds, Procedural content
备注:
点击查看摘要
Abstract:Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: this https URL
110. 【2601.03388】Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models
链接:https://arxiv.org/abs/2601.03388
作者:Zhibo Hu,Chen Wang,Yanfeng Shu,Hye-young Paik,Liming Zhu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:human decision making, influence human decision, Earlier research, metaphors influence human, influence large language
备注: 17 pages, 7 figures
点击查看摘要
Abstract:Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
111. 【2601.03369】RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models
链接:https://arxiv.org/abs/2601.03369
作者:Sha Luo,Yogesh Prabhu,Tim Ossowski,Kaiping Chen,Junjie Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:centered social media, ensuring public safety, video centered social, social media, preventing real world
备注:
点击查看摘要
Abstract:With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
112. 【2601.03368】A path to natural language through tokenisation and transformers
链接:https://arxiv.org/abs/2601.03368
作者:David S. Berman,Alexander G. Stapleton
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Zipf and Heaps', exhibit striking regularities, emergence of Zipf, languages exhibit striking, including notably
备注: 19 pages, 7 figures, 2 tables
点击查看摘要
Abstract:Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps' laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte--pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.
113. 【2601.03324】Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64
链接:https://arxiv.org/abs/2601.03324
作者:Bugra Kilictas,Faruk Alpay
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, deployment of Large, Memory Wall, Virtual Tensor Core
备注: 14 pages, 2 figures. Code and data available at [this https URL](https://github.com/farukalpay/stories100m)
点击查看摘要
Abstract:The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of 60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.
114. 【2601.03288】How Real is Your Jailbreak? Fine-grained Jailbreak Evaluation with Anchored Reference
链接:https://arxiv.org/abs/2601.03288
作者:Songyang Liu,Chaozhuo Li,Rui Pu,Litian Zhang,Chenxu Wang,Zejian Chen,Yuting Zhang,Yiming Hei
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, methods largely rely, Large Language, current automated evaluation
备注: 7 pages, 3 figures, preprint
点击查看摘要
Abstract:Jailbreak attacks present a significant challenge to the safety of Large Language Models (LLMs), yet current automated evaluation methods largely rely on coarse classifications that focus mainly on harmfulness, leading to substantial overestimation of attack success. To address this problem, we propose FJAR, a fine-grained jailbreak evaluation framework with anchored references. We first categorized jailbreak responses into five fine-grained categories: Rejective, Irrelevant, Unhelpful, Incorrect, and Successful, based on the degree to which the response addresses the malicious intent of the query. This categorization serves as the basis for FJAR. Then, we introduce a novel harmless tree decomposition approach to construct high-quality anchored references by breaking down the original queries. These references guide the evaluator in determining whether the response genuinely fulfills the original query. Extensive experiments demonstrate that FJAR achieves the highest alignment with human judgment and effectively identifies the root causes of jailbreak failures, providing actionable guidance for improving attack strategies.
115. 【2601.03286】HyperCLOVA X 32B Think
链接:https://arxiv.org/abs/2601.03286
作者:NAVER Cloud HyperCLOVA X Team
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:vision-language model designed, cultural context, linguistic and cultural, Korean linguistic, agentic ability
备注: Technical Report
点击查看摘要
Abstract:In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.
116. 【2601.03276】opic Segmentation Using Generative Language Models
链接:https://arxiv.org/abs/2601.03276
作者:Pierre Mackenzie,Maya Shah,Patrick Frenett
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:generative Large Language, Large Language Models, Large Language, generative Large, Language Models
备注:
点击查看摘要
Abstract:Topic segmentation using generative Large Language Models (LLMs) remains relatively unexplored. Previous methods use semantic similarity between sentences, but such models lack the long range dependencies and vast knowledge found in LLMs. In this work, we propose an overlapping and recursive prompting strategy using sentence enumeration. We also support the adoption of the boundary similarity evaluation metric. Results show that LLMs can be more effective segmenters than existing methods, but issues remain to be solved before they can be relied upon for topic segmentation.
117. 【2601.03274】LLM_annotate: A Python package for annotating and analyzing fiction characters
链接:https://arxiv.org/abs/2601.03274
作者:Hannes Rosenbusch
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Python package, language models, analyzing the personality, personality of fiction
备注:
点击查看摘要
Abstract:LLM_annotate is a Python package for analyzing the personality of fiction characters with large language models. It standardizes workflows for annotating character behaviors in full texts (e.g., books and movie scripts), inferring character traits, and validating annotation/inference quality via a human-in-the-loop GUI. The package includes functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Researchers can use any LLM, commercial, open-source, or custom, within LLM_annotate. Through tutorial examples using The Simpsons Movie and the novel Pride and Prejudice, I demonstrate the usage of the package for efficient and reproducible character analyses.
118. 【2601.03273】GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators
链接:https://arxiv.org/abs/2601.03273
作者:Naseem Machlovi,Maryam Saleki,Ruhul Amin,Mohamed Rahouti,Shawqi Al-Maliki,Junaid Qadir,Mohamed M. Abdallah,Ala Al-Fuqaha
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:safer moderation systems, daily life, distinguishing between naive, censorship boundaries, large language models
备注:
点击查看摘要
Abstract:As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.
119. 【2601.03272】Less is more: Not all samples are effective for evaluation
链接:https://arxiv.org/abs/2601.03272
作者:Wentang Song,Jinqiang Li,Kele Huang,Junhui Lin,Shengxiang Wu,Zhongshi Xie
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, versatility of Large, numerous specialized evaluation, Large Language, spurred the development
备注:
点击查看摘要
Abstract:The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original test samples using only their raw textual content. In this domain-adapted embedding space, we perform task-aware clustering and introduce a novel dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate the compression intensity based on the intrinsic redundancy of the benchmark. Experiments on professional-domain dataset, notably a large-scale 3GPP communications benchmark, demonstrate that our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2601.03272 [cs.CL]
(or
arXiv:2601.03272v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.03272
Focus to learn more
arXiv-issued DOI via DataCite</p>
120. 【2601.03270】Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey
链接:https://arxiv.org/abs/2601.03270
作者:Lokendra Kumar,Neelesh S. Upadhye,Kannan Piedy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
关键词:Semantic Textual Similarity, Textual Similarity, Semantic Textual, contrastive learning, research has expanded
备注: 16 pages, 2 figures
点击查看摘要
Abstract:Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.
121. 【2601.03269】he Instruction Gap: LLMs get lost in Following Instruction
链接:https://arxiv.org/abs/2601.03269
作者:Vishesh Tripathi,Uday Allu,Biddwan Ahmed
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, natural language understanding, Large Language, natural language, language understanding
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the "instruction gap" - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for organizations deploying LLM-powered solutions and establishes benchmarks for instruction-following capabilities across major model families.
122. 【2601.03268】WRAVAL -- WRiting Assist eVALuation
链接:https://arxiv.org/abs/2601.03268
作者:Gabriel Benedict,Matthew Butler,Naved Merchant,Eetu Salama-Laine
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, shifted language model, Large Language, Small Language Models, Language Models
备注:
点击查看摘要
Abstract:The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) -- defined here as models under 10B parameters -- typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: this https URL.
123. 【2601.03267】OpenAI GPT-5 System Card
链接:https://arxiv.org/abs/2601.03267
作者:Aaditya Singh,Adam Fry,Adam Perelman,Adam Tart,Adi Ganesh,Ahmed El-Kishky,Aidan McLaughlin,Aiden Low,AJ Ostrow,Akhila Ananthram,Akshay Nathan,Alan Luo,Alec Helyar,Aleksander Madry,Aleksandr Efremov,Aleksandra Spyra,Alex Baker-Whitcomb,Alex Beutel,Alex Karpenko,Alex Makelov,Alex Neitz,Alex Wei,Alexandra Barr,Alexandre Kirchmeyer,Alexey Ivanov,Alexi Christakis,Alistair Gillespie,Allison Tam,Ally Bennett,Alvin Wan,Alyssa Huang,Amy McDonald Sandjideh,Amy Yang,Ananya Kumar,Andre Saraiva,Andrea Vallone,Andrei Gheorghe,Andres Garcia Garcia,Andrew Braunstein,Andrew Liu,Andrew Schmidt,Andrey Mereskin,Andrey Mishchenko,Andy Applebaum,Andy Rogerson,Ann Rajan,Annie Wei,Anoop Kotha,Anubha Srivastava,Anushree Agrawal,Arun Vijayvergiya,Ashley Tyra,Ashvin Nair,Avi Nayak,Ben Eggers,Bessie Ji,Beth Hoover,Bill Chen,Blair Chen,Boaz Barak,Borys Minaiev,Botao Hao,Bowen Baker,Brad Lightcap,Brandon McKinzie,Brandon Wang,Brendan Quinn,Brian Fioca,Brian Hsu,Brian Yang,Brian Yu,Brian Zhang,Brittany Brenner,Callie Riggins Zetino,Cameron Raymond,Camillo Lugaresi,Carolina Paz,Cary Hudson,Cedric Whitney,Chak Li,Charles Chen,Charlotte Cole,Chelsea Voss,Chen Ding,Chen Shen,Chengdu Huang,Chris Colby,Chris Hallacy,Chris Koch,Chris Lu,Christina Kaplan,Christina Kim,CJ Minott-Henriques,Cliff Frey,Cody Yu,Coley Czarnecki,Colin Reid,Colin Wei,Cory Decareaux,Cristina Scheau
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:card published alongside, alongside the OpenAI, published alongside, system card published, August
备注:
点击查看摘要
Abstract:This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2601.03267 [cs.CL]
(or
arXiv:2601.03267v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.03267
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Kristen Ying [view email] [v1]
Fri, 19 Dec 2025 07:05:38 UTC (3,895 KB)
124. 【2601.03266】Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support
链接:https://arxiv.org/abs/2601.03266
作者:Alif Munim,Jun Ma,Omar Ibrahim,Alhusain Abdalla,Shuolin Yin,Leo Chen,Bo Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, cloud-based infrastructure, Large language, rapidly advanced, hindered by privacy
备注:
点击查看摘要
Abstract:Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.
125. 【2601.03265】Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models
链接:https://arxiv.org/abs/2601.03265
作者:Kai Hu,Abhinav Aggarwal,Mehran Khodabandeh,David Zhang,Eric Hsin,Li Chen,Ankit Jain,Matt Fredrikson,Akash Bharadwaj
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:Large Language Model, red teaming methodology, constrained example-based approach, Large Language, effective policy-based framework
备注: Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025
点击查看摘要
Abstract:This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.
126. 【2601.03263】Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models
链接:https://arxiv.org/abs/2601.03263
作者:Edward Y. Chang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models exhibit, prioritizing agreeableness, agreeableness over correctness
备注: 15 pages, 4 figures, 11 tables
点击查看摘要
Abstract:Large Language Models exhibit sycophancy: prioritizing agreeableness over correctness. Current remedies evaluate reasoning outcomes: RLHF rewards correct answers, self-correction critiques outputs. All require ground truth, which is often unavailable at inference time and vulnerable to the same biases. We explore evaluating the reasoning process instead. Regulated Causal Anchoring (RCA) verifies whether outputs follow from their reasoning traces, without requiring ground truth. Sycophancy manifests as trace-output inconsistency: models derive one answer but output another to please users. RCA detects this inconsistency, achieving 0.0% sycophancy while accepting 88% of valid hints. We identify two failures invisible to outcome evaluation: Inverse Scaling (frontier models sycophant more because rationalization requires capability) and the Final Output Gap (correct reasoning precedes sycophantic output). Traditional self-correction reduces these failures to 7-9% but cannot eliminate them because the model critiques itself with the same biases. RCA's process evaluation operates at inference time, requires no ground truth, and uses an independent judge that breaks the self-reinforcing bias loop: three properties that outcome evaluation lacks.
127. 【2601.03262】Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
链接:https://arxiv.org/abs/2601.03262
作者:Xiantao Zhang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Visually rich documents, challenge retrieval-augmented generation, brittle OCR, Visually rich, Multimodal Large Language
备注: 18 pages; accepted at AACL-IJCNLP 2025 (main conference)
点击查看摘要
Abstract:Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.
128. 【2601.03261】DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing
链接:https://arxiv.org/abs/2601.03261
作者:Shuo Lu,Yinuo Xu,Jianjie Cheng,Lingxiao He,Meng Wang,Jian Liang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:maximize retrieval probability, Deep Research agents, agents predominantly optimize, predominantly optimize search, optimize search policies
备注: Ongoing work
点击查看摘要
Abstract:Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.
129. 【2601.03260】SciNetBench: A Relation-Aware Benchmark for Scientific Literature Retrieval Agents
链接:https://arxiv.org/abs/2601.03260
作者:Chenyang Shao,Yong Li,Fengli Xu
类目:Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
关键词:rapid development, Deep Research, retrieval, advanced research tools, retrieval agents
备注:
点击查看摘要
Abstract:The rapid development of AI agent has spurred the development of advanced research tools, such as Deep Research. Achieving this require a nuanced understanding of the relations within scientific literature, surpasses the scope of keyword-based or embedding-based retrieval. Existing retrieval agents mainly focus on the content-level similarities and are unable to decode critical relational dynamics, such as identifying corroborating or conflicting studies or tracing technological lineages, all of which are essential for a comprehensive literature review. Consequently, this fundamental limitation often results in a fragmented knowledge structure, misleading sentiment interpretation, and inadequate modeling of collective scientific progress. To investigate relation-aware retrieval more deeply, we propose SciNetBench, the first Scientific Network Relation-aware Benchmark for literature retrieval agents. Constructed from a corpus of over 18 million AI papers, our benchmark systematically evaluates three levels of relations: ego-centric retrieval of papers with novel knowledge structures, pair-wise identification of scholarly relationships, and path-wise reconstruction of scientific evolutionary trajectories. Through extensive evaluation of three categories of retrieval agents, we find that their accuracy on relation-aware retrieval tasks often falls below 20%, revealing a core shortcoming of current retrieval paradigms. Notably, further experiments on the literature review tasks demonstrate that providing agents with relational ground truth leads to a substantial 23.4% performance improvement in the review quality, validating the critical importance of relation-aware retrieval. We publicly release our benchmark at this https URL to support future research on advanced retrieval systems.
130. 【2601.03469】Content vs. Form: What Drives the Writing Score Gap Across Socioeconomic Backgrounds? A Generated Panel Approach
链接:https://arxiv.org/abs/2601.03469
作者:Nadav Kunievsky,Pedro Pertusi
类目:Econometrics (econ.EM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:socioeconomic backgrounds exhibit, backgrounds exhibit persistent, exhibit persistent gaps, socioeconomic backgrounds, backgrounds exhibit
备注:
点击查看摘要
Abstract:Students from different socioeconomic backgrounds exhibit persistent gaps in test scores, gaps that can translate into unequal educational and labor-market outcomes later in life. In many assessments, performance reflects not only what students know, but also how effectively they can communicate that knowledge. This distinction is especially salient in writing assessments, where scores jointly reward the substance of students' ideas and the way those ideas are expressed. As a result, observed score gaps may conflate differences in underlying content with differences in expressive skill. A central question, therefore, is how much of the socioeconomic-status (SES) gap in scores is driven by differences in what students say versus how they say it. We study this question using a large corpus of persuasive essays written by U.S. middle- and high-school students. We introduce a new measurement strategy that separates content from style by leveraging large language models to generate multiple stylistic variants of each essay. These rewrites preserve the underlying arguments while systematically altering surface expression, creating a "generated panel" that introduces controlled within-essay variation in style. This approach allows us to decompose SES gaps in writing scores into contributions from content and style. We find an SES gap of 0.67 points on a 1-6 scale. Approximately 69% of the gap is attributable to differences in essay content quality, Style differences account for 26% of the gap, and differences in evaluation standards across SES groups account for the remaining 5%. These patterns seems stable across demographic subgroups and writing tasks. More broadly, our approach shows how large language models can be used to generate controlled variation in observational data, enabling researchers to isolate and quantify the contributions of otherwise entangled factors.
131. 【2601.03277】MixRx: Predicting Drug Combination Interactions with LLMs
链接:https://arxiv.org/abs/2601.03277
作者:Risha Surana,Cameron Saidock,Hugo Chacon
类目:Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, multi-drug patient history, classify drug combination, drug combination interactions
备注:
点击查看摘要
Abstract:MixRx uses Large Language Models (LLMs) to classify drug combination interactions as Additive, Synergistic, or Antagonistic, given a multi-drug patient history. We evaluate the performance of 4 models, GPT-2, Mistral Instruct 2.0, and the fine-tuned counterparts. Our results showed a potential for such an application, with the Mistral Instruct 2.0 Fine-Tuned model providing an average accuracy score on standard and perturbed datasets of 81.5%. This paper aims to further develop an upcoming area of research that evaluates if LLMs can be used for biological prediction tasks.
信息检索
1. 【2601.04019】Modeling Behavioral Patterns in News Recommendations Using Fuzzy Neural Networks
链接:https://arxiv.org/abs/2601.04019
作者:Kevin Innerebner,Stephan Bartl,Markus Reiter-Haas,Elisabeth Lex
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:black-box models, offering little transparency, editorial decision-making, increasingly driven, driven by black-box
备注: Accepted for the IR for Good track at ECIR'26
点击查看摘要
Abstract:News recommender systems are increasingly driven by black-box models, offering little transparency for editorial decision-making. In this work, we introduce a transparent recommender system that uses fuzzy neural networks to learn human-readable rules from behavioral data for predicting article clicks. By extracting the rules at configurable thresholds, we can control rule complexity and thus, the level of interpretability. We evaluate our approach on two publicly available news datasets (i.e., MIND and EB-NeRD) and show that we can accurately predict click behavior compared to several established baselines, while learning human-readable rules. Furthermore, we show that the learned rules reveal news consumption patterns, enabling editors to align content curation goals with target audience behavior.
2. 【2601.03903】Unleashing the Potential of Neighbors: Diffusion-based Latent Neighbor Generation for Session-based Recommendation
链接:https://arxiv.org/abs/2601.03903
作者:Yuhan Yang,Jie Zou,Guojia An,Jiwei Wei,Yang Yang,Heng Tao Shen
类目:Information Retrieval (cs.IR)
关键词:current session interactions, latent neighbors, Session-based recommendation aims, current session, Session-based recommendation
备注: This paper has been accepted by KDD 2026
点击查看摘要
Abstract:Session-based recommendation aims to predict the next item that anonymous users may be interested in, based on their current session interactions. Recent studies have demonstrated that retrieving neighbor sessions to augment the current session can effectively alleviate the data sparsity issue and improve recommendation performance. However, existing methods typically rely on explicitly observed session data, neglecting latent neighbors - not directly observed but potentially relevant within the interest space - thereby failing to fully exploit the potential of neighbor sessions in recommendation. To address the above limitation, we propose a novel model of diffusion-based latent neighbor generation for session-based recommendation, named DiffSBR. Specifically, DiffSBR leverages two diffusion modules, including retrieval-augmented diffusion and self-augmented diffusion, to generate high-quality latent neighbors. In the retrieval-augmented diffusion module, we leverage retrieved neighbors as guiding signals to constrain and reconstruct the distribution of latent neighbors. Meanwhile, we adopt a training strategy that enables the retriever to learn from the feedback provided by the generator. In the self-augmented diffusion module, we explicitly guide the generation of latent neighbors by injecting the current session's multi-modal signals through contrastive learning. After obtaining the generated latent neighbors, we utilize them to enhance session representations for improving session-based recommendation. Extensive experiments on four public datasets show that DiffSBR generates effective latent neighbors and improves recommendation performance against state-of-the-art baselines.
3. 【2601.03793】Prompt Tuning without Labeled Samples for Zero-Shot Node Classification in Text-Attributed Graphs
链接:https://arxiv.org/abs/2601.03793
作者:Sethupathy Parameswaran,Suresh Sundaram,Yuan Fang
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:grouping articles published, articles published online, real-world applications, social networks, grouping articles
备注: Accepted by WSDM 2026
点击查看摘要
Abstract:Node classification is a fundamental problem in information retrieval with many real-world applications, such as community detection in social networks, grouping articles published online and product categorization in e-commerce. Zero-shot node classification in text-attributed graphs (TAGs) presents a significant challenge, particularly due to the absence of labeled data. In this paper, we propose a novel Zero-shot Prompt Tuning (ZPT) framework to address this problem by leveraging a Universal Bimodal Conditional Generator (UBCG). Our approach begins with pre-training a graph-language model to capture both the graph structure and the associated textual descriptions of each node. Following this, a conditional generative model is trained to learn the joint distribution of nodes in both graph and text modalities, enabling the generation of synthetic samples for each class based solely on the class name. These synthetic node and text embeddings are subsequently used to perform continuous prompt tuning, facilitating effective node classification in a zero-shot setting. Furthermore, we conduct extensive experiments on multiple benchmark datasets, demonstrating that our framework performs better than existing state-of-the-art baselines. We also provide ablation studies to validate the contribution of the bimodal generator. The code is provided at: this https URL.
4. 【2601.03748】Bridging OLAP and RAG: A Multidimensional Approach to the Design of Corpus Partitioning
链接:https://arxiv.org/abs/2601.03748
作者:Dario Maio,Stefano Rizzi
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:large-scale document collections, Retrieval-Augmented Generation, document collections, comprising millions, millions of text
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems are increasingly deployed on large-scale document collections, often comprising millions of documents and tens of millions of text chunks. In industrial-scale retrieval platforms, scalability is typically addressed through horizontal sharding and a combination of Approximate Nearest-Neighbor search, hybrid indexing, and optimized metadata filtering. Although effective from an efficiency perspective, these mechanisms rely on bottom-up, similarity-driven organization and lack a conceptual rationale for corpus partitioning. In this paper, we claim that the design of large-scale RAG systems may benefit from the combination of two orthogonal strategies: semantic clustering, which optimizes locality in embedding space, and multidimensional partitioning, which governs where retrieval should occur based on conceptual dimensions such as time and organizational context. Although such dimensions are already implicitly present in current systems, they are used in an ad hoc and poorly structured manner. We propose the Dimensional Fact Model (DFM) as a conceptual framework to guide the design of multidimensional partitions for RAG corpora. The DFM provides a principled way to reason about facts, dimensions, hierarchies, and granularity in retrieval-oriented settings. This framework naturally supports hierarchical routing and controlled fallback strategies, ensuring that retrieval remains robust even in the presence of incomplete metadata, while transforming the search process from a 'black-box' similarity matching into a governable and deterministic workflow. This work is intended as a position paper; its goal is to bridge the gap between OLAP-style multidimensional modeling and modern RAG architectures, and to stimulate further research on principled, explainable, and governable retrieval strategies at scale.
5. 【2601.03730】Perception-Aware Bias Detection for Query Suggestions
链接:https://arxiv.org/abs/2601.03730
作者:Fabian Haak,Philipp Schaer
类目:Information Retrieval (cs.IR)
关键词:bias detection, query suggestions, Bias, bias detection research, detection
备注: 13 pages (pp. 130-142); 2 figures; 2 tables; Workshop paper (BIAS 2021) published in CCIS vol. 1418 (Springer)
点击查看摘要
Abstract:Bias in web search has been in the spotlight of bias detection research for quite a while. At the same time, little attention has been paid to query suggestions in this regard. Awareness of the problem of biased query suggestions has been raised. Likewise, there is a rising need for automatic bias detection approaches. This paper adds on the bias detection pipeline for bias detection in query suggestions of person-related search developed by Bonart et al. \cite{Bonart_2019a}. The sparseness and lack of contextual metadata of query suggestions make them a difficult subject for bias detection. Furthermore, query suggestions are perceived very briefly and subliminally. To overcome these issues, perception-aware metrics are introduced. Consequently, the enhanced pipeline is able to better detect systematic topical bias in search engine query suggestions for person-related searches. The results of an analysis performed with the developed pipeline confirm this assumption. Due to the perception-aware bias detection metrics, findings produced by the pipeline can be assumed to reflect bias that users would discern.
6. 【2601.03628】Global research trends and collaborations in Fibrodysplasia Ossificans Progressiva: A bibliometric analysis (1989-2023)
链接:https://arxiv.org/abs/2601.03628
作者:Muneer Ahmad,Undie Felicia Nkatv,Sajid Saleem
类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)
关键词:Fibrodysplasia Ossificans Progressiva, debilitating genetic disorder, genetic disorder characterized, Fibrodysplasia Ossificans, Ossificans Progressiva
备注: 23 page, 4 figures, Research article
点击查看摘要
Abstract:Fibrodysplasia Ossificans Progressiva (FOP) is a rare and debilitating genetic disorder characterized by the progressive formation of bone in muscles and connective tissues. This scientometric analysis examines the global research trends on FOP between 1989 and 2023 using bibliographic data from Web of Science. The study highlights key patterns in publication productivity, influential journals, institutions, and the geographical distribution of research. The findings reveal that the United States leads both in terms of total publications and citation impact, with significant contributions from the UK, Italy, Japan, and other European countries. Additionally, the analysis identifies the major document types, including articles and reviews, and evaluates the collaborative efforts across institutions. The study offers valuable insights into the global research landscape of FOP, providing a foundation for future studies and international collaborations.
7. 【2601.03608】Shielded RecRL: Explanation Generation for Recommender Systems without Ranking Degradation
链接:https://arxiv.org/abs/2601.03608
作者:Ansh Tiwari,Ayush Chauhan
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:reinforcement learning approach, original ranking performance, system original ranking, introduce Shielded RecRL, system original
备注:
点击查看摘要
Abstract:We introduce Shielded RecRL, a reinforcement learning approach to generate personalized explanations for recommender systems without sacrificing the system's original ranking performance. Unlike prior RLHF-based recommender methods that directly optimize item rankings, our two-tower architecture keeps the recommender's ranking model intact while a language model learns to produce helpful explanations. We design a composite reward signal combining explanation length, content relevance, and coherence, and apply proximal policy optimization (PPO) with a KL-divergence constraint to fine-tune a large language model with only 0.4% of its parameters trainable via LoRA adapters. In experiments on an Amazon Books dataset (approximately 50K interactions in the fantasy and romance genres), Shielded RecRL improved the relative click-through rate (CTR) by 22.5% (1.225x over baseline) while keeping the recommender's item-ranking behavior virtually unchanged. An extensive ablation study confirms that our gradient shielding strategy and reward design effectively balance explanation quality and policy drift. Our results demonstrate that Shielded RecRL enhances user-facing aspects of recommendations through rich, personalized explanations without degrading core recommendation accuracy.
8. 【2601.03600】ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification
链接:https://arxiv.org/abs/2601.03600
作者:Xiao Lin,Philip Li,Zhichen Zeng,Tingwei Li,Tianxin Wei,Xuying Ning,Gaotang Li,Yuzhong Chen,Hanghang Tong
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:large language models, remain highly susceptible, rich safety alignment, compromise safety guardrails, large language
备注:
点击查看摘要
Abstract:Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.
9. 【2601.03496】STELLA: Self-Reflective Terminology-Aware Framework for Building an Aerospace Information Retrieval Benchmark
链接:https://arxiv.org/abs/2601.03496
作者:Bongmin Kim
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Aerospace Information Retrieval, public information retrieval, aerospace industry heavily, industry heavily rely, Information Retrieval Benchmark
备注: 25 pages, 2 figures
点击查看摘要
Abstract:Tasks in the aerospace industry heavily rely on searching and reusing large volumes of technical documents, yet there is no public information retrieval (IR) benchmark that reflects the terminology- and query-intent characteristics of this domain. To address this gap, this paper proposes the STELLA (Self-Reflective TErminoLogy-Aware Framework for BuiLding an Aerospace Information Retrieval Benchmark) framework. Using this framework, we introduce the STELLA benchmark, an aerospace-specific IR evaluation set constructed from NASA Technical Reports Server (NTRS) documents via a systematic pipeline that comprises document layout detection, passage chunking, terminology dictionary construction, synthetic query generation, and cross-lingual extension. The framework generates two types of queries: the Terminology Concordant Query (TCQ), which includes the terminology verbatim to evaluate lexical matching, and the Terminology Agnostic Query (TAQ), which utilizes the terminology's description to assess semantic matching. This enables a disentangled evaluation of the lexical and semantic matching capabilities of embedding models. In addition, we combine Chain-of-Density (CoD) and the Self-Reflection method with query generation to improve quality and implement a hybrid cross-lingual extension that reflects real user querying practices. Evaluation of seven embedding models on the STELLA benchmark shows that large decoder-based embedding models exhibit the strongest semantic understanding, while lexical matching methods such as BM25 remain highly competitive in domains where exact lexical matching technical term is crucial. The STELLA benchmark provides a reproducible foundation for reliable performance evaluation and improvement of embedding models in aerospace-domain IR tasks. The STELLA benchmark can be found in this https URL.
10. 【2601.03479】Efficient Sequential Recommendation for Long Term User Interest Via Personalization
链接:https://arxiv.org/abs/2601.03479
作者:Qiang Zhang,Hanchao Yu,Ivan Ji,Chen Yuan,Yi Zhang,Chihuang Liu,Xiaolong Wang,Christopher E. Lambert,Ren Chen,Chen Kovacs,Xinzhu Bei,Renqin Cai,Rui Li,Lizhu Zhang,Xiangjun Fan,Qunshu Zhang,Benyu Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:large language model, generative recommender, years have witnessed, witnessed success, large language
备注: ICDM 2025
点击查看摘要
Abstract:Recent years have witnessed success of sequential modeling, generative recommender, and large language model for recommendation. Though the scaling law has been validated for sequential models, it showed inefficiency in computational capacity when considering real-world applications like recommendation, due to the non-linear(quadratic) increasing nature of the transformer model. To improve the efficiency of the sequential model, we introduced a novel approach to sequential recommendation that leverages personalization techniques to enhance efficiency and performance. Our method compresses long user interaction histories into learnable tokens, which are then combined with recent interactions to generate recommendations. This approach significantly reduces computational costs while maintaining high recommendation accuracy. Our method could be applied to existing transformer based recommendation models, e.g., HSTU and HLLM. Extensive experiments on multiple sequential models demonstrate its versatility and effectiveness. Source code is available at \href{this https URL}{this https URL}.
11. 【2601.03474】SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation
链接:https://arxiv.org/abs/2601.03474
作者:José Isidro,Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:semantically meaningful units, dividing continuous text, natural language processing, Linear text segmentation, language processing
备注:
点击查看摘要
Abstract:Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.
12. 【2601.03262】Roles of MLLMs in Visually Rich Document Retrieval for RAG: A Survey
链接:https://arxiv.org/abs/2601.03262
作者:Xiantao Zhang
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Visually rich documents, challenge retrieval-augmented generation, brittle OCR, Visually rich, Multimodal Large Language
备注: 18 pages; accepted at AACL-IJCNLP 2025 (main conference)
点击查看摘要
Abstract:Visually rich documents (VRDs) challenge retrieval-augmented generation (RAG) with layout-dependent semantics, brittle OCR, and evidence spread across complex figures and structured tables. This survey examines how Multimodal Large Language Models (MLLMs) are being used to make VRD retrieval practical for RAG. We organize the literature into three roles: Modality-Unifying Captioners, Multimodal Embedders, and End-to-End Representers. We compare these roles along retrieval granularity, information fidelity, latency and index size, and compatibility with reranking and grounding. We also outline key trade-offs and offer some practical guidance on when to favor each role. Finally, we identify promising directions for future research, including adaptive retrieval units, model size reduction, and the development of evaluation methods.
13. 【2601.03259】LLMDiRec: LLM-Enhanced Intent Diffusion for Sequential Recommendation
链接:https://arxiv.org/abs/2601.03259
作者:Bo-Chian Chen,Manel Slokom
类目:Information Retrieval (cs.IR)
关键词:advanced diffusion-based approaches, Existing sequential recommendation, Existing sequential, underlying user behavior, Large Language Models
备注: Under review
点击查看摘要
Abstract:Existing sequential recommendation models, even advanced diffusion-based approaches, often struggle to capture the rich semantic intent underlying user behavior, especially for new users or long-tail items. This limitation stems from their reliance on ID-based embeddings, which lack semantic grounding. We introduce LLMDiRec, a new approach that addresses this gap by integrating Large Language Models (LLMs) into an intent-aware diffusion model. Our approach combines collaborative signals from ID embeddings with rich semantic representations from LLMs, using a dynamic fusion mechanism and a multi-task objective to align both views. We run extensive experiments on five public datasets. We run extensive experiments on five public datasets. We demonstrate that \modelname outperforms state-of-the-art algorithms, with particularly strong improvements in capturing complex user intents and enhancing recommendation performance for long-tail items.
14. 【2601.03258】Enhancing Retrieval-Augmented Generation with Two-Stage Retrieval: FlashRank Reranking and Query Expansion
链接:https://arxiv.org/abs/2601.03258
作者:Sherine George
类目:Information Retrieval (cs.IR)
关键词:ground generated responses, Retrieval-Augmented Generation, large language model, limited LLM context, couples a retriever
备注: 3 pages, 1 figure, 3 tables
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) couples a retriever with a large language model (LLM) to ground generated responses in external evidence. While this framework enhances factuality and domain adaptability, it faces a key bottleneck: balancing retrieval recall with limited LLM context. Retrieving too few passages risks missing critical context, while retrieving too many overwhelms the prompt window, diluting relevance and increasing cost. We propose a two-stage retrieval pipeline that integrates LLM-driven query expansion to improve candidate recall and FlashRank, a fast marginal-utility reranker that dynamically selects an optimal subset of evidence under a token budget. FlashRank models document utility as a weighted combination of relevance, novelty, brevity, and cross-encoder evidence. Together, these modules form a generalizable solution that increases answer accuracy, faithfulness, and computational efficiency.
Comments:
3 pages, 1 figure, 3 tables
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2601.03258 [cs.IR]
(or
arXiv:2601.03258v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2601.03258
Focus to learn more
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.13140/RG.2.2.12631.74408
Focus to learn more
DOI(s) linking to related resources</p>
计算机视觉
1. 【2601.04194】Choreographing a World of Dynamic Objects
链接:https://arxiv.org/abs/2601.04194
作者:Yanzhe Lyu,Chen Geng,Karthik Dharmarajan,Yunzhi Zhang,Hadi Alzayer,Shangzhe Wu,Jiajun Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:world are constantly, constantly evolving, Dynamic objects, CHOReographing Dynamic objects, dynamics
备注:
点击查看摘要
Abstract:Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: this https URL
2. 【2601.04185】ImLoc: Revisiting Visual Localization with Image-based Representation
链接:https://arxiv.org/abs/2601.04185
作者:Xudong Jiang,Fangjinhua Wang,Silvano Galliani,Christoph Vogel,Marc Pollefeys
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:effective geometric reasoning, Existing visual localization, difficult to update, visual localization methods, revisit visual localization
备注: Code will be available at [this https URL](https://github.com/cvg/Hierarchical-Localization)
点击查看摘要
Abstract:Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at this https URL.
3. 【2601.04159】oTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography
链接:https://arxiv.org/abs/2601.04159
作者:Vladimir Frants,Sos Agaian,Karen Panetta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:blood volume pulse, Remote photoplethysmography, facial videos captured, estimates a blood, volume pulse
备注:
点击查看摘要
Abstract:Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.
4. 【2601.04153】Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
链接:https://arxiv.org/abs/2601.04153
作者:Yifan Wang,Yanyu Li,Sergey Tulyakov,Yun Fu,Anil Kag
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Direct Preference Optimization, enhancing visual fidelity, Direct Preference, recently improved, generation by enhancing
备注:
点击查看摘要
Abstract:Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
5. 【2601.04151】Klear: Unified Multi-Task Audio-Video Joint Generation
链接:https://arxiv.org/abs/2601.04151
作者:Jun Wang,Chunyu Qiang,Yuxin Guo,Yiran Wang,Xijuan Zeng,Chen Zhang,Pengfei Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
关键词:progressed rapidly, challenges still remain, substantial challenges, suffer audio-visual asynchrony, audio-visual correspondence modeling
备注:
点击查看摘要
Abstract:Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.
6. 【2601.04137】Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test
链接:https://arxiv.org/abs/2601.04137
作者:Chun-Kai Fan,Xiaowei Chi,Xiaozhu Ju,Hao Li,Yong Bao,Yu-Kai Wang,Lizhang Chen,Zhiyuan Jiang,Kuangzhi Ge,Ying Li,Weishi Mi,Qingpo Wuwu,Peidong Jia,Yulin Luo,Kevin Zhang,Zhiyuan Qin,Yong Dai,Sirui Han,Yike Guo,Shanghang Zhang,Jian Tang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Embodied Turing Test, models gain momentum, downstream embodied tasks, Turing Test, Human Turing Test
备注:
点击查看摘要
Abstract:As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.
7. 【2601.04127】Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images
链接:https://arxiv.org/abs/2601.04127
作者:Leandro Stival,Ricardo da Silva Torres,Helio Pedrini
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:including satellite image, Satellites continuously generate, including satellite, satellite image time, Satellites continuously
备注: 21 pages, 9 Figures
点击查看摘要
Abstract:Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on
8. 【2601.04126】InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
链接:https://arxiv.org/abs/2601.04126
作者:Ziyun Zhang,Zezhou Wang,Xiaoyi Zhang,Zongyu Guo,Jiahao Li,Bin Li,Yan Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:practical AI assistants, interact with graphical, graphical interfaces, interfaces on behalf, behalf of users
备注: Work In Progress
点击查看摘要
Abstract:GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.
9. 【2601.04121】MORPHFED: Federated Learning for Cross-institutional Blood Morphology Analysis
链接:https://arxiv.org/abs/2601.04121
作者:Gabriel Ansah,Eden Ruffell,Delmiro Fernandez-Reyes,Petru Manescu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:support hematological diagnostics, Automated blood morphology, diagnostics in low, middle-income countries, staining variability
备注:
点击查看摘要
Abstract:Automated blood morphology analysis can support hematological diagnostics in low- and middle-income countries (LMICs) but remains sensitive to dataset shifts from staining variability, imaging differences, and rare morphologies. Building centralized datasets to capture this diversity is often infeasible due to privacy regulations and data-sharing restrictions. We introduce a federated learning framework for white blood cell morphology analysis that enables collaborative training across institutions without exchanging training data. Using blood films from multiple clinical sites, our federated models learn robust, domain-invariant representations while preserving complete data privacy. Evaluations across convolutional and transformer-based architectures show that federated training achieves strong cross-site performance and improved generalization to unseen institutions compared to centralized training. These findings highlight federated learning as a practical and privacy-preserving approach for developing equitable, scalable, and generalizable medical imaging AI in resource-limited healthcare environments.
10. 【2601.04118】GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning
链接:https://arxiv.org/abs/2601.04118
作者:Wenshuai Li,Xiantai Xiang,Zixiao Wen,Guangyao Zhou,Ben Niu,Feng Wang,Lijia Huang,Qiantong Wang,Yuxin Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Remote Sensing Vision-Language, Sensing Vision-Language Models, Remote Sensing, complex spatial tasks, evolution of Remote
备注:
点击查看摘要
Abstract:The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
11. 【2601.04090】Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
链接:https://arxiv.org/abs/2601.04090
作者:Jiaxin Huang,Yuanbo Yang,Bangbang Yang,Lin Ma,Yuewen Ma,Yiyi Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video diffusion models, bridges the strong, video diffusion, foundational reconstruction models, diffusion models
备注: Project page: [this https URL](https://xdimlab.github.io/Gen3R/)
点击查看摘要
Abstract:We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.
12. 【2601.04073】Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts
链接:https://arxiv.org/abs/2601.04073
作者:Zhihao Zhu,Jiafeng Liang,Shixin Jiang,Jinlan Fu,Ming Liu,Guanglu Sun,See-Kiong Ng,Bing Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Multimodal Models, Large Multimodal, demonstrated impressive capabilities, Multimodal Models, demonstrated impressive
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
13. 【2601.04068】Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models
链接:https://arxiv.org/abs/2601.04068
作者:Zitong Huang,Kaidong Zhang,Yukang Ding,Chao Gao,Rui Ding,Ying Chen,Wangmeng Zuo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Direct Preference Otimization, crucial for generating, Existing Direct Preference, Aligning, generating high-quality videos
备注: Under Review
点击查看摘要
Abstract:Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
14. 【2601.04065】Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation
链接:https://arxiv.org/abs/2601.04065
作者:Raül Pérez-Gonzalo,Riccardo Magro,Andreas Espersen,Antonio Agudo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:reduce energy output, degrade aerodynamic performance, minor surface damages, Reliable operation, accelerate blade wear
备注: Accepted to WACV 2026
点击查看摘要
Abstract:Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.
15. 【2601.04061】CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos
链接:https://arxiv.org/abs/2601.04061
作者:Chubin Zhang,Jianan Wang,Zifeng Gao,Yue Su,Tianru Dai,Cai Zhou,Jiwen Lu,Yansong Tang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Latent Action Pretraining, Latent Action Models, Contrastive Latent Action, Existing Latent Action, Latent Action
备注: Project page: [this https URL](https://lin-shan.com/CLAP/)
点击查看摘要
Abstract:Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: this https URL.
16. 【2601.04033】hinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
链接:https://arxiv.org/abs/2601.04033
作者:Yuan Wang,Borui Liao,Huijuan Huang,Jinda Lu,Ouxiang Li,Kuien Liu,Meng Wang,Xiang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, strategies have improved, post-training strategies, structural distortions, Recent
备注:
点击查看摘要
Abstract:Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
17. 【2601.04005】Padé Neurons for Efficient Neural Models
链接:https://arxiv.org/abs/2601.04005
作者:Onur Keleş,A. Murat Tekalp
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Paons, neurons, point-wise activation, point-wise non-linear activation, McCulloch-Pitts neuron model
备注: Accepted for Publication in IEEE TRANSACTIONS ON IMAGE PROCESSING; 13 pages, 8 figures
点击查看摘要
Abstract:Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (Paons), inspired by Padé approximants. Paons offer several advantages, such as diversity of non-linearity, since each Paon learns a different non-linear function of its inputs, and layer efficiency, since Paons provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, Paons include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by Paons. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of Paons, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with Paons. Our comprehensive experimental results and analyses demonstrate that neural models built by Paons provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for Paon is open-sourced at this https URL.
18. 【2601.03993】PosterVerse: A Full-Workflow Framework for Commercial-Grade Poster Generation with HTML-Based Scalable Typography
链接:https://arxiv.org/abs/2601.03993
作者:Junle Liu,Peirong Zhang,Yuyi Zhang,Pengyu Yan,Hui Zhou,Xinyue Zhou,Fengjun Guo,Lianwen Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:informative content delivery, appeal with precise, informative content, content delivery, demands the seamless
备注:
点击查看摘要
Abstract:Commercial-grade poster design demands the seamless integration of aesthetic appeal with precise, informative content delivery. Current automated poster generation systems face significant limitations, including incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. To address these challenges, we propose PosterVerse, a full-workflow, commercial-grade poster generation method that seamlessly automates the entire design process while delivering high-density and scalable text rendering. PosterVerse replicates professional design through three key stages: (1) blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) graphical background generation via customized diffusion models to create visually appealing imagery, and (3) unified layout-text rendering with an MLLM-powered HTML engine to guarantee high text accuracy and flexible customization. In addition, we introduce PosterDNA, a commercial-grade, HTML-based dataset tailored for training and validating poster design models. To the best of our knowledge, PosterDNA is the first Chinese poster generation dataset to introduce HTML typography files, enabling scalable text rendering and fundamentally solving the challenges of rendering small and high-density text. Experimental results demonstrate that PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts, making it a promising solution for automating commercial poster design. The code and model are available at this https URL.
19. 【2601.03959】FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion
链接:https://arxiv.org/abs/2601.03959
作者:Enes Duran,Nikos Athanasiou,Muhammed Kocabas,Michael J. Black,Omid Taheri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:motion, full-body motion, hand, full-body, hand motion
备注:
点击查看摘要
Abstract:Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.
20. 【2601.03955】ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation
链接:https://arxiv.org/abs/2601.03955
作者:Xu Zhang,Cheng Da,Huan Yang,Kun Gai,Ming Lu,Zhan Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sequential token streams, treating visual data, flat sequential token, generation largely follow, Existing
备注: Technical report
点击查看摘要
Abstract:Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring "vision" back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at this https URL.
21. 【2601.03928】FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
链接:https://arxiv.org/abs/2601.03928
作者:Mingyu Ouyang,Kevin Qinghong Lin,Mike Zheng Shou,Hwee Tou Ng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:User Interface, Vision-Language Models, process increasingly high-resolution, increasingly high-resolution screenshots, shown remarkable performance
备注: 14 pages, 13 figures
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
22. 【2601.03915】HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis
链接:https://arxiv.org/abs/2601.03915
作者:Julie van Logtestijn,Petru Manescu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:limiting clinical trust, current deep learning, Microscopic evaluation, deep learning models, white blood cell
备注:
点击查看摘要
Abstract:Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.
23. 【2601.03884】FLNet: Flood-Induced Agriculture Damage Assessment using Super Resolution of Satellite Images
链接:https://arxiv.org/abs/2601.03884
作者:Sanidhya Ghosal,Anurag Sharma,Sushil Ghildiyal,Mukesh Saini
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Distributing government relief, government relief efforts, Distributing government, government relief, relief efforts
备注: Accepted for oral presentation at the 10th International Conference on Computer Vision and Image Processing (CVIP 2025)
点击查看摘要
Abstract:Distributing government relief efforts after a flood is challenging. In India, the crops are widely affected by floods; therefore, making rapid and accurate crop damage assessment is crucial for effective post-disaster agricultural management. Traditional manual surveys are slow and biased, while current satellite-based methods face challenges like cloud cover and low spatial resolution. Therefore, to bridge this gap, this paper introduced FLNet, a novel deep learning based architecture that used super-resolution to enhance the 10 m spatial resolution of Sentinel-2 satellite images into 3 m resolution before classifying damage. We tested our model on the Bihar Flood Impacted Croplands Dataset (BFCD-22), and the results showed an improved critical "Full Damage" F1-score from 0.83 to 0.89, nearly matching the 0.89 score of commercial high-resolution imagery. This work presented a cost-effective and scalable solution, paving the way for a nationwide shift from manual to automated, high-fidelity damage assessment.
24. 【2601.03869】Bayesian Monocular Depth Refinement via Neural Radiance Fields
链接:https://arxiv.org/abs/2601.03869
作者:Arun Muthukkumar
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:computer vision task, essential computer vision, Neural Radiance Fields, extended reality, vision task
备注: IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025). Oral presentation; Best Presenter Award
点击查看摘要
Abstract:Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
25. 【2601.03824】IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
链接:https://arxiv.org/abs/2601.03824
作者:Wei Long,Haifeng Wu,Shiyin Jiang,Jinhua Zhang,Xinchun Ji,Shuhang Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Gaussian Splatting aims, Gaussian Splatting, directly predict Gaussian, Splatting aims, predict Gaussian parameters
备注:
点击查看摘要
Abstract:Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps produced by cascading warp operations in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively boosts depth probability estimates and updates the depth candidates, the depth map is gradually refined, resulting in accurate Gaussian means. We conduct experiments on RealEstate10K, ACID, and DL3DV. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, our IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.
26. 【2601.03811】EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging
链接:https://arxiv.org/abs/2601.03811
作者:Jan Tagscherer,Sarah de Boer,Lena Philipp,Fennie van der Graaf,Dré Peeters,Joeran Bosma,Lars Leijten,Bogdan Obreja,Ewoud Smit,Alessa Hering
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:requires continuous monitoring, Developing foundation models, imaging requires continuous, Developing foundation, requires continuous
备注: Accepted at BVM 2026
点击查看摘要
Abstract:Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at this https URL.
27. 【2601.03808】From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs
链接:https://arxiv.org/abs/2601.03808
作者:Usha Shrestha,Dmitry Ignatov,Radu Timofte
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large language models, Large language, data-aware augmentation remains, achieved notable performance, limiting factor
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.
28. 【2601.03784】A Comparative Study of 3D Model Acquisition Methods for Synthetic Data Generation of Agricultural Products
链接:https://arxiv.org/abs/2601.03784
作者:Steven Moonen,Rob Salaets,Kenneth Batstone,Abdellatif Bey-Temsamani,Nick Michiels
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision systems, vision systems based, computer vision, artificial intelligence, increase production
备注: 6 pages, 3 figures, 1 table, presented at 4th International Conference on Responsible Consumption and Production, [this https URL](https://link.springer.com/book/9783032173546)
点击查看摘要
Abstract:In the manufacturing industry, computer vision systems based on artificial intelligence (AI) are widely used to reduce costs and increase production. Training these AI models requires a large amount of training data that is costly to acquire and annotate, especially in high-variance, low-volume manufacturing environments. A popular approach to reduce the need for real data is the use of synthetic data that is generated by leveraging computer-aided design (CAD) models available in the industry. However, in the agricultural industry these models are not readily available, increasing the difficulty in leveraging synthetic data. In this paper, we present different techniques for substituting CAD files to create synthetic datasets. We measure their relative performance when used to train an AI object detection model to separate stones and potatoes in a bin picking environment. We demonstrate that using highly representative 3D models acquired by scanning or using image-to-3D approaches can be used to generate synthetic data for training object detection models. Finetuning on a small real dataset can significantly improve the performance of the models and even get similar performance when less representative models are used.
29. 【2601.03782】PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
链接:https://arxiv.org/abs/2601.03782
作者:Wenlong Huang,Yu-Wei Chao,Arsalan Mousavian,Ming-Yu Liu,Dieter Fox,Kaichun Mo,Li Fei-Fei
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans anticipate, equally vital, world model, point flows, contemplated action
备注:
点击查看摘要
Abstract:Humans anticipate, from a glance and a contemplated action of their bodies, how the 3D world will respond, a capability that is equally vital for robotic manipulation. We introduce PointWorld, a large pre-trained 3D world model that unifies state and action in a shared 3D space as 3D point flows: given one or few RGB-D images and a sequence of low-level robot action commands, PointWorld forecasts per-pixel displacements in 3D that respond to the given actions. By representing actions as 3D point flows instead of embodiment-specific action spaces (e.g., joint positions), this formulation directly conditions on physical geometries of robots while seamlessly integrating learning across embodiments. To train our 3D world model, we curate a large-scale dataset spanning real and simulated robotic manipulation in open-world environments, enabled by recent advances in 3D vision and simulated environments, totaling about 2M trajectories and 500 hours across a single-arm Franka and a bimanual humanoid. Through rigorous, large-scale empirical studies of backbones, action representations, learning objectives, partial observability, data mixtures, domain transfers, and scaling, we distill design principles for large-scale 3D world modeling. With a real-time (0.1s) inference speed, PointWorld can be efficiently integrated in the model-predictive control (MPC) framework for manipulation. We demonstrate that a single pre-trained checkpoint enables a real-world Franka robot to perform rigid-body pushing, deformable and articulated object manipulation, and tool use, without requiring any demonstrations or post-training and all from a single image captured in-the-wild. Project website at this https URL.
30. 【2601.03781】MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction
链接:https://arxiv.org/abs/2601.03781
作者:Xiaokun Sun,Zezhong Wu,Zewen Ding,Linli Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Reinforcement learning based, Video Large Language, Large Language, achieved significant success
备注:
点击查看摘要
Abstract:Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model's understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.
31. 【2601.03741】I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing
链接:https://arxiv.org/abs/2601.03741
作者:Jinghan Yu,Junhao Xiao,Chenyu Zhu,Jiaming Li,Jia Li,HanMing Deng,Xirui Wang,Guoli Jia,Jianjun Li,Zhiyuan Ma,Xiang Bai,Bowen Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing text-guided image, Existing text-guided, pixel-level inpainting paradigm, methods primarily rely, pixel-level inpainting
备注:
点击查看摘要
Abstract:Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.
32. 【2601.03736】HyperCOD: The First Challenging Benchmark and Baseline for Hyperspectral Camouflaged Object Detection
链接:https://arxiv.org/abs/2601.03736
作者:Shuyan Bai,Tingfa Xu,Peifu Liu,Yuhao Qiu,Huiyan Bai,Huan Chen,Yanyan Peng,Jianan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:camouflaged object detection, RGB-based camouflaged object, object detection struggles, cues are ambiguous, camouflaged object
备注:
点击查看摘要
Abstract:RGB-based camouflaged object detection struggles in real-world scenarios where color and texture cues are ambiguous. While hyperspectral image offers a powerful alternative by capturing fine-grained spectral signatures, progress in hyperspectral camouflaged object detection (HCOD) has been critically hampered by the absence of a dedicated, large-scale benchmark. To spur innovation, we introduce HyperCOD, the first challenging benchmark for HCOD. Comprising 350 high-resolution hyperspectral images, It features complex real-world scenarios with minimal objects, intricate shapes, severe occlusions, and dynamic lighting to challenge current models. The advent of foundation models like the Segment Anything Model (SAM) presents a compelling opportunity. To adapt the Segment Anything Model (SAM) for HCOD, we propose HyperSpectral Camouflage-aware SAM (HSC-SAM). HSC-SAM ingeniously reformulates the hyperspectral image by decoupling it into a spatial map fed to SAM's image encoder and a spectral saliency map that serves as an adaptive prompt. This translation effectively bridges the modality gap. Extensive experiments show that HSC-SAM sets a new state-of-the-art on HyperCOD and generalizes robustly to other public HSI datasets. The HyperCOD dataset and our HSC-SAM baseline provide a robust foundation to foster future research in this emerging area.
33. 【2601.03733】RadDiff: Describing Differences in Radiology Image Sets with Natural Language
链接:https://arxiv.org/abs/2601.03733
作者:Xiaoxian Shen,Yuhui Zhang,Sahithi Ankireddy,Xiaohan Wang,Maya Varma,Henry Guo,Curtis Langlotz,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:image sets differ, generating clinical insights, radiology image sets, sets differ, differ is critical
备注:
点击查看摘要
Abstract:Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
34. 【2601.03729】MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species
链接:https://arxiv.org/abs/2601.03729
作者:Donghwan Lee,Byeongjin Kim,Geunhee Kim,Hyukjin Kwon,Nahyeon Maeng,Wooju Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:animals supports ecology, marine animals supports, supports ecology, biodiversity and habitat, habitat conservation
备注:
点击查看摘要
Abstract:Fine-grained classification of marine animals supports ecology, biodiversity and habitat conservation, and evidence-based policy-making. However, existing methods often overlook contextual interactions from the surrounding environment and insufficiently incorporate the hierarchical structure of marine biological taxonomy. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a novel model designed for fine-grained marine species classification. MATANet mimics expert strategies by using taxonomy and environmental context to interpret ambiguous features of underwater animals. It consists of two key components: a Multi-Context Environmental Attention Module (MCEAM), which learns relationships between regions of interest (ROIs) and their surrounding environments, and a Hierarchical Separation-Induced Learning Module (HSLM), which encodes taxonomic hierarchy into the feature space. MATANet combines instance and environmental features with taxonomic structure to enhance fine-grained classification. Experiments on the FathomNet2025, FAIR1M, and LifeCLEF2015-Fish datasets demonstrate state-of-the-art performance. The source code is available at: this https URL
35. 【2601.03728】CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval
链接:https://arxiv.org/abs/2601.03728
作者:Zhipeng Qian,Zihan Liang,Yufei Ma,Ben Chen,Huangyu Dai,Yiwei Ma,Jiayi Ji,Chenyi Lei,Han Li,Xiaoshuai Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Composed Image Retrieval, offering substantial advantages, single-modality retrieval systems, Composed Image, manipulation text
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.
36. 【2601.03718】owards Real-world Lens Active Alignment with Unlabeled Data via Domain Adaptation
链接:https://arxiv.org/abs/2601.03718
作者:Wenyong Lia,Qi Jiang,Weijian Hu,Kailun Yang,Zhanjun Zhang,Wenjun Tian,Kaiwei Wang,Jian Bai
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
关键词:high-precision optical systems, Adaptive Active Alignment, Active Alignment, key technology, Domain Adaptive Active
备注:
点击查看摘要
Abstract:Active Alignment (AA) is a key technology for the large-scale automated assembly of high-precision optical systems. Compared with labor-intensive per-model on-device calibration, a digital-twin pipeline built on optical simulation offers a substantial advantage in generating large-scale labeled data. However, complex imaging conditions induce a domain gap between simulation and real-world images, limiting the generalization of simulation-trained models. To address this, we propose augmenting a simulation baseline with minimal unlabeled real-world images captured at random misalignment positions, mitigating the gap from a domain adaptation perspective. We introduce Domain Adaptive Active Alignment (DA3), which utilizes an autoregressive domain transformation generator and an adversarial-based feature alignment strategy to distill real-world domain information via self-supervised learning. This enables the extraction of domain-invariant image degradation features to facilitate robust misalignment prediction. Experiments on two lens types reveal that DA3 improves accuracy by 46% over a purely simulation pipeline. Notably, it approaches the performance achieved with precisely labeled real-world data collected on 3 lens samples, while reducing on-device data collection time by 98.7%. The results demonstrate that domain adaptation effectively endows simulation-trained models with robust real-world performance, validating the digital-twin pipeline as a practical solution to significantly enhance the efficiency of large-scale optical assembly.
37. 【2601.03714】Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
链接:https://arxiv.org/abs/2601.03714
作者:Yunhao Liang,Ruixuan Ying,Bo Li,Hong Li,Kai Yan,Qingwen Li,Min Yang,Okamoto Satoshi,Zhe Cui,Shiwen Ni
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:exceeding ten times, achieve high-ratio vision-text, tokens exceeding ten, mapping approach, claiming to decode
备注:
点击查看摘要
Abstract:DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at this https URL.
38. 【2601.03713】BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion
链接:https://arxiv.org/abs/2601.03713
作者:Qingyao Tian,Bingyu Yang,Huai Liao,Xinyan Huang,Junyong Li,Dong Yi,Hongbin Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently shown remarkable, shown remarkable performance, leveraging large-scale pretraining, recently shown, shown remarkable
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
39. 【2601.03667】Rec: Egocentric Action Recognition using 2D Point Tracks
链接:https://arxiv.org/abs/2601.03667
作者:Dennis Holzmann,Sven Wachsmuth
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:additional motion cue, approach for egocentric, motion cue, RGB appearance, Abstract
备注: submitted to ICPR 2026
点击查看摘要
Abstract:We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.
40. 【2601.03666】5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
链接:https://arxiv.org/abs/2601.03666
作者:Haonan Chen,Sicheng Gao,Radu Timofte,Tetsuya Sakai,Zhicheng Dou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern information systems, Modern information, types of items, text query, video clip
备注:
点击查看摘要
Abstract:Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at this https URL.
41. 【2601.03665】PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance
链接:https://arxiv.org/abs/2601.03665
作者:Siddarth Nilol Kundur Satish,Devesh Jaiswal,Hongyu Chen,Abhishek Bakshi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current video generation, unnatural object collisions, produce high-quality aesthetic, high-quality aesthetic videos, real-world physics dynamics
备注: 9 pages, 2 figures, project page: [this https URL](https://github.com/CVFall2025-Project/PhysVideoGenerator)
点击查看摘要
Abstract:Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.
42. 【2601.03660】MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding
链接:https://arxiv.org/abs/2601.03660
作者:Jiangyuan Liu,Hongxuan Ma,Yuhao Zhao,Zhe Liu,Jian Wang,Wei Zou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:partial observations caused, Convolutional Neural Network, Point cloud completion, cloud completion aims, recover complete
备注: Code and dataset are available at [this https URL](https://github.com/L-J-Yuan/MGPC)
点击查看摘要
Abstract:Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.
43. 【2601.03655】VideoMemory: Toward Consistent Video Generation via Memory Integration
链接:https://arxiv.org/abs/2601.03655
作者:Jinsong Zhou,Yihua Du,Xinli Xu,Luozhou Wang,Zijie Zhuang,Yehang Zhang,Shuaibo Li,Xiaojun Hu,Bolan Su,Ying-cong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dynamic Memory Bank, Maintaining consistent characters, Maintaining consistent, environments across multiple, central challenge
备注: Project page: [this https URL](https://hit-perfect.github.io/VideoMemory/)
点击查看摘要
Abstract:Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.
44. 【2601.03637】CrackSegFlow: Controllable Flow-Matching Synthesis for Generalizable Crack Segmentation with the CSF-50K Benchmark
链接:https://arxiv.org/abs/2601.03637
作者:Babak Asadi,Peiyang Wu,Mani Golparvar-Fard,Ramez Hajj
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scalable condition assessment, scarce pixel-level labels, severe domain shift, Automated crack segmentation, Automated crack
备注:
点击查看摘要
Abstract:Automated crack segmentation is essential for scalable condition assessment of pavements and civil infrastructure, yet practical deployment is limited by scarce pixel-level labels and severe domain shift across sensors, illumination, textures, and annotation conventions. This paper presents CrackSegFlow, a controllable flow-matching synthesis framework that generates photorealistic crack images conditioned on binary masks while preserving strict mask-image alignment. The generator combines topology-preserving mask injection with boundary-gated modulation to maintain thin-structure continuity and suppress texture-driven false positives. A second class-conditional flow-matching model synthesizes crack masks with explicit control over crack coverage, enabling balanced, topology-diverse paired data without additional manual annotation. We further inject crack masks into crack-free backgrounds to diversify illumination and surface artifacts and reduce false positives caused by shadows, joints, and pavement markings. Experiments on five benchmarks spanning four asphalt datasets and the crack class of a concrete-domain dataset demonstrate consistent improvements under an established hybrid CNN--Transformer segmentation backbone and a fixed training protocol. With real plus synthesized pairs, in-domain performance improves on average by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields average gains of 13.12 mIoU and 14.82 F1 using only limited target mask statistics. Compared with diffusion-based semantic synthesis, CrackSegFlow provides substantially faster deterministic sampling and improves fidelity and mask-image alignment for thin-structure crack geometry. Finally, we release CSF-50K, a public dataset of 50,000 paired crack images and pixel-accurate masks for large-scale benchmarking of generalizable crack segmentation.
45. 【2601.03633】MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction
链接:https://arxiv.org/abs/2601.03633
作者:Wenjie Luo,Chuanhu Deng,Chaorong Li,Rongyao Deng,Qiang Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Accurate and high-resolution, radar echo sequences, high-resolution precipitation nowcasting, economic planning, significant challenge
备注:
点击查看摘要
Abstract:Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.
46. 【2601.03625】Shape Classification using Approximately Convex Segment Features
链接:https://arxiv.org/abs/2601.03625
作者:Bimal Kumar Ray
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:classification techniques based, existing object classification, object classification techniques, descriptive features rely, classification techniques
备注:
点击查看摘要
Abstract:The existing object classification techniques based on descriptive features rely on object alignment to compute the similarity of objects for classification. This paper replaces the necessity of object alignment through sorting of feature. The object boundary is normalized and segmented into approximately convex segments and the segments are then sorted in descending order of their length. The segment length, number of extreme points in segments, area of segments, the base and the width of the segments - a bag of features - is used to measure the similarity between image boundaries. The proposed method is tested on datasets and acceptable results are observed.
47. 【2601.03617】Systematic Evaluation of Depth Backbones and Semantic Cues for Monocular Pseudo-LiDAR 3D Detection
链接:https://arxiv.org/abs/2601.03617
作者:Samson Oseiwe Ajadalu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:object detection offers, single image, estimating metric depth, offers a low-cost, low-cost alternative
备注: 7 pages, 4 figures
点击查看摘要
Abstract:Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50\% AP$_{3D}$ at IoU$=0.7$ on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.
48. 【2601.03609】Unveiling Text in Challenging Stone Inscriptions: A Character-Context-Aware Patching Strategy for Binarization
链接:https://arxiv.org/abs/2601.03609
作者:Pratyush Jena,Amal Joseph,Arnav Sharma,Ravi Kiran Sarvadevabhatla
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:popular first step, Binarization, text extraction, Indic, binarization performance
备注:
点击查看摘要
Abstract:Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: this https URL
49. 【2601.03596】Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations
链接:https://arxiv.org/abs/2601.03596
作者:Qianyu Guo,Jingrong Wu,Jieji Ren,Weifeng Ge,Wenqiang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Few-shot segmentation, segment specific targets, aims to rapidly, industrial inspection, rapidly learn
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images, and has been widely applied in areas such as medical diagnosis and industrial inspection. However, existing studies largely overlook the complex environmental factors encountered in real world scenarios-such as illumination, background, and camera viewpoint-which can substantially increase the difficulty of test images. As a result, models trained under laboratory conditions often fall short of practical deployment requirements. To bridge this gap, in this paper, an environment-robust FSS setting is introduced that explicitly incorporates challenging test cases arising from complex environments-such as motion blur, small objects, and camouflaged targets-to enhance model's robustness under realistic, dynamic conditions. An environment robust FSS benchmark (ER-FSS) is established, covering eight datasets across multiple real world scenarios. In addition, an Adaptive Attention Distillation (AAD) method is proposed, which repeatedly contrasts and distills key shared semantics between known (support) and unknown (query) images to derive class-specific attention for novel categories. This strengthens the model's ability to focus on the correct targets in complex environments, thereby improving environmental robustness. Comparative experiments show that AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization. The source code and dataset are available at: this https URL.
50. 【2601.03590】Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions
链接:https://arxiv.org/abs/2601.03590
作者:Zhongbin Guo,Zhen Yang,Yushan Li,Xinyue Zhang,Wenyu Gao,Jiacheng Wang,Chengzhi Li,Xiangrui Liu,Ping Jian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent advancements, Large Language Models, spatial understanding originate, Spatial Intelligence, predominantly relied
备注:
点击查看摘要
Abstract:Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at this https URL .
51. 【2601.03586】Detecting AI-Generated Images via Distributional Deviations from Real Images
链接:https://arxiv.org/abs/2601.03586
作者:Yakun Niu,Yingjian Chen,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:raising concerns, public trust, rapid advancement, concerns about misinformation, erosion of public
备注:
点击查看摘要
Abstract:The rapid advancement of generative models has significantly enhanced the quality of AI-generated images, raising concerns about misinformation and the erosion of public trust. Detecting AI-generated images has thus become a critical challenge, particularly in terms of generalizing to unseen generative models. Existing methods using frozen pre-trained CLIP models show promise in generalization but treat the image encoder as a basic feature extractor, failing to fully exploit its potential. In this paper, we perform an in-depth analysis of the frozen CLIP image encoder (CLIP-ViT), revealing that it effectively clusters real images in a high-level, abstract feature space. However, it does not truly possess the ability to distinguish between real and AI-generated images. Based on this analysis, we propose a Masking-based Pre-trained model Fine-Tuning (MPFT) strategy, which introduces a Texture-Aware Masking (TAM) mechanism to mask textured areas containing generative model-specific patterns during fine-tuning. This approach compels CLIP-ViT to attend to the "distributional deviations"from authentic images for AI-generated image detection, thereby achieving enhanced generalization performance. Extensive experiments on the GenImage and UniversalFakeDetect datasets demonstrate that our method, fine-tuned with only a minimal number of images, significantly outperforms existing approaches, achieving up to 98.2% and 94.6% average accuracy on the two datasets, respectively.
52. 【2601.03579】SpatiaLoc: Leveraging Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization
链接:https://arxiv.org/abs/2601.03579
作者:Tianyi Shang,Pengjie Xu,Zhaojun Deng,Zhenyu Li,Zhicong Chen,Lijun Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language descriptions, clouds enables robots, point clouds enables, Cross-modal localization, text and point
备注:
点击查看摘要
Abstract:Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
53. 【2601.03549】EASLT: Emotion-Aware Sign Language Translation
链接:https://arxiv.org/abs/2601.03549
作者:Guobin Tu,Di Weng
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:complex cross-modal task, cross-modal task requiring, Non-Manual Signals, Sign Language Translation, Manual Signals
备注:
点击查看摘要
Abstract:Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present **EASLT** (**E**motion-**A**ware **S**ign **L**anguage **T**ranslation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel *Emotion-Aware Fusion* (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at this https URL.
54. 【2601.03534】Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach
链接:https://arxiv.org/abs/2601.03534
作者:Yilong Dai,Ziyi Wang,Chenguang Wang,Kexin Zhou,Yiheng Qian,Susu Xu,Xiang Yan
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:creating cyclist-friendly cities, advancing sustainable urban, sustainable urban transportation, requires incorporating users', incorporating users' perceptions
备注:
点击查看摘要
Abstract:Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
55. 【2601.03528】CloudMatch: Weak-to-Strong Consistency Learning for Semi-Supervised Cloud Detection
链接:https://arxiv.org/abs/2601.03528
作者:Jiayi Zhao,Changlu Chen,Jingsheng Li,Tianxiang Xue,Kun Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurate pixel-level labels, annotating accurate pixel-level, pixel-level labels, high cost, cost of annotating
备注: Journal of Applied Remote Sensing
点击查看摘要
Abstract:Due to the high cost of annotating accurate pixel-level labels, semi-supervised learning has emerged as a promising approach for cloud detection. In this paper, we propose CloudMatch, a semi-supervised framework that effectively leverages unlabeled remote sensing imagery through view-consistency learning combined with scene-mixing augmentations. An observation behind CloudMatch is that cloud patterns exhibit structural diversity and contextual variability across different scenes and within the same scene category. Our key insight is that enforcing prediction consistency across diversely augmented views, incorporating both inter-scene and intra-scene mixing, enables the model to capture the structural diversity and contextual richness of cloud patterns. Specifically, CloudMatch generates one weakly augmented view along with two complementary strongly augmented views for each unlabeled image: one integrates inter-scene patches to simulate contextual variety, while the other employs intra-scene mixing to preserve semantic coherence. This approach guides pseudolabel generation and enhances generalization. Extensive experiments show that CloudMatch achieves good performance, demonstrating its capability to utilize unlabeled data efficiently and advance semi-supervised cloud detection.
56. 【2601.03526】Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution
链接:https://arxiv.org/abs/2601.03526
作者:Zhicheng Zhao,Fengjiao Peng,Jinquan Yan,Wei Lu,Chenglong Li,Jin Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Optics-guided thermal UAV, all-weather monitoring applications, attracted significant research, significant research interest, research interest due
备注:
点击查看摘要
Abstract:Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.
57. 【2601.03517】Semantic Belief-State World Model for 3D Human Motion Prediction
链接:https://arxiv.org/abs/2601.03517
作者:Sarim Chaudhry
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sequence regression problem, extrapolate future joint, future joint coordinates, observed pose histories, models extrapolate future
备注:
点击查看摘要
Abstract:Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.
58. 【2601.03510】G2P: Gaussian-to-Point Attribute Alignment for Boundary-Aware 3D Semantic Segmentation
链接:https://arxiv.org/abs/2601.03510
作者:Hojun Song,Chae-yeong Song,Jeong-hun Hong,Chaewon Moon,Dong-hwi Kim,Gahyeon Kim,Soo Ye Kim,Yiyi Liao,Jaehyup Lee,Sang-hyo Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic segmentation, Semantic, scene understanding, Abstract, point
备注: Preprint. Under review
点击查看摘要
Abstract:Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian-to-Point (G2P), which transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance-consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point-wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.
59. 【2601.03507】REFA: Real-time Egocentric Facial Animations for Virtual Reality
链接:https://arxiv.org/abs/2601.03507
作者:Qiang Zhang,Tong Xiao,Haroun Habeeb,Larissa Laich,Sofien Bouaziz,Patrick Snape,Wenjing Zhang,Matthew Cioffi,Peizhao Zhang,Pavel Pidlypenskyi,Winnie Lin,Luming Ma,Mengjiao Wang,Kunpeng Li,Chengjiang Long,Steven Song,Martin Prazak,Alexander Sjoholm,Ajinkya Deogade,Jaebong Lee,Julio Delgado Mangas,Amaury Aubel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:egocentric views captured, infrared cameras embedded, real-time tracking, egocentric views, views captured
备注: CVPR 2024 Workshop
点击查看摘要
Abstract:We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, \eg synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.
60. 【2601.03500】SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
链接:https://arxiv.org/abs/2601.03500
作者:Yuxuan Xia,Siheng Wang,Peng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, Large Vision-Language, demonstrate significant progress, understanding and reasoning, critical challenge
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
61. 【2601.03490】CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation
链接:https://arxiv.org/abs/2601.03490
作者:Yuzhe Sun,Zhe Dong,Haochen Jiang,Tianzhu Liu,Yanfeng Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:localize specific targets, image segmentation aims, complex overhead imagery, textbf, overhead imagery
备注:
点击查看摘要
Abstract:Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.
62. 【2601.03468】Understanding Reward Hacking in Text-to-Image Reinforcement Learning
链接:https://arxiv.org/abs/2601.03468
作者:Yunqi Hong,Kuei-Chun Kao,Hengguang Zhou,Cho-Jui Hsieh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reinforcement learning, post-training large language, large language models, enhance generation quality, human preference alignment
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.
63. 【2601.03467】hinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing
链接:https://arxiv.org/abs/2601.03467
作者:Hengjia Li,Liming Jiang,Qing Yan,Yizhi Song,Hao Kang,Zichuan Liu,Xin Lu,Boxi Wu,Deng Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Instruction-driven image editing, unified multimodal generative, multimodal generative models, Instruction-driven image, advanced rapidly
备注:
点击查看摘要
Abstract:Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
64. 【2601.03466】Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization
链接:https://arxiv.org/abs/2601.03466
作者:Joshua Salako
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:sparsity remain critical, remain critical bottlenecks, data sparsity remain, massive interaction datasets, sparsity remain
备注:
点击查看摘要
Abstract:Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high-performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system's practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: this https URL
65. 【2601.03463】Experimental Comparison of Light-Weight and Deep CNN Models Across Diverse Datasets
链接:https://arxiv.org/abs/2601.03463
作者:Md. Hefzul Hossain Papon,Shadman Rabby
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:agricultural variety classification, specialized pre-trained models, well-regularized shallow architecture, highly competitive baseline, requiring large GPUs
备注: 25 pages, 11 figures
点击查看摘要
Abstract:Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.
66. 【2601.03460】FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder
链接:https://arxiv.org/abs/2601.03460
作者:Zeyu Dong,Yimin Zhu,Yu Wu,Yu Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:map sensor inputs, autonomous driving aim, directly map sensor, complex scenarios remains, control commands
备注:
点击查看摘要
Abstract:End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder's weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.
67. 【2601.03431】WeedRepFormer: Reparameterizable Vision Transformers for Real-Time Waterhemp Segmentation and Gender Classification
链接:https://arxiv.org/abs/2601.03431
作者:Toqi Tahamid Sarker,Taminul Islam,Khaled R. Ahmed,Cristiana Bernardi Rankrape,Kaitlin E. Creager,Karla Gage
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-task Vision Transformer, Vision Transformer designed, lightweight multi-task Vision, Vision Transformer, Vision Transformer backbone
备注: 11 pages, 5 figures
点击查看摘要
Abstract:We present WeedRepFormer, a lightweight multi-task Vision Transformer designed for simultaneous waterhemp segmentation and gender classification. Existing agricultural models often struggle to balance the fine-grained feature extraction required for biological attribute classification with the efficiency needed for real-time deployment. To address this, WeedRepFormer systematically integrates structural reparameterization across the entire architecture - comprising a Vision Transformer backbone, a Lite R-ASPP decoder, and a novel reparameterizable classification head - to decouple training-time capacity from inference-time latency. We also introduce a comprehensive waterhemp dataset containing 10,264 annotated frames from 23 plants. On this benchmark, WeedRepFormer achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification using only 3.59M parameters and 3.80 GFLOPs. At 108.95 FPS, our model outperforms the state-of-the-art iFormer-T by 4.40% in classification accuracy while maintaining competitive segmentation performance and significantly reducing parameter count by 1.9x.
68. 【2601.03416】GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models
链接:https://arxiv.org/abs/2601.03416
作者:Xiangdong Hu,Yangyang Jiang,Qin Hu,Xiaojun Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Large Language, alignment remains fragile, safety alignment remains
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.
69. 【2601.03410】Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning
链接:https://arxiv.org/abs/2601.03410
作者:Abdul Rehman Akbar,Alejandro Levya,Ashwini Esnakula,Elshad Hasanov,Anne Noonan,Upender Manne,Vaibhav Sahai,Lingbin Meng,Susan Tsai,Anil Parwani,Wei Chen,Ashish Manne,Muhammad Khalid Khan Niazi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:basal-like and classical, PDAC, PDAC into basal-like, PanSubNet, Molecular
备注:
点击查看摘要
Abstract:Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard HE-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine HE-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.
70. 【2601.03400】Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning
链接:https://arxiv.org/abs/2601.03400
作者:Ali Najar,Alireza Mirrokni,Arshia Izadyari,Sadegh Mohammadian,Amir Homayoon Sharifizade,Asal Meskin,Mobin Bagherian,Ehsaneddin Asgari
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:standard vision-language benchmarks, achieved strong performance, standard vision-language, deeper reasoning, achieved strong
备注: 8 pages
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models' ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
71. 【2601.03392】Better, But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics
链接:https://arxiv.org/abs/2601.03392
作者:Matteo Dunnhofer,Christian Micheloni,Kohitij Kar
类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:ventral visual stream, artificial neural networks, static images remain, primate ventral visual, Feedforward artificial neural
备注: Extended Abstract at the 2nd Human-inspired Computer Vision workshop at ICCV 2025
点击查看摘要
Abstract:Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT's temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on "appearance-free" variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.
72. 【2601.03382】A Novel Unified Approach to Deepfake Detection
链接:https://arxiv.org/abs/2601.03382
作者:Lord Sen,Shyamapada Mukherjee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly giving rise, increasingly giving, giving rise, Deepfake detection, Abstract
备注:
点击查看摘要
Abstract:The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.
73. 【2601.03369】RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models
链接:https://arxiv.org/abs/2601.03369
作者:Sha Luo,Yogesh Prabhu,Tim Ossowski,Kaiping Chen,Junjie Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:centered social media, ensuring public safety, video centered social, social media, preventing real world
备注:
点击查看摘要
Abstract:With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.
74. 【2601.03362】Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
链接:https://arxiv.org/abs/2601.03362
作者:Xiang Zhang,Yang Zhang,Lukas Mehl,Markus Gross,Christopher Schroers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:thin hairs, computer-generated imagery, commonly observed, observed in natural, natural and computer-generated
备注:
点击查看摘要
Abstract:Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.
75. 【2601.03357】RelightAnyone: A Generalized Relightable 3D Gaussian Head Model
链接:https://arxiv.org/abs/2601.03357
作者:Yingyan Xu,Pramod Rao,Sebastian Weiss,Gaspard Zoss,Markus Gross,Christian Theobalt,Marc Habermann,Derek Bradley
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Gaussian Splatting, OLAT, render photorealistic, OLAT lighting, standard approach
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.
76. 【2601.03331】MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models
链接:https://arxiv.org/abs/2601.03331
作者:Yang Shi,Yifeng Xie,Minzhe Guo,Liangsi Lu,Mingxuan Huang,Jingchao Wang,Zhihong Zhu,Boyan Xu,Zhiqi Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Recent advances, raising the question, advances in Vision-Language, improved performance, understand the content
备注:
点击查看摘要
Abstract:Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: this https URL
77. 【2601.03326】Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation
链接:https://arxiv.org/abs/2601.03326
作者:Jarek Duda
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:covariance matrix approximating, matrix approximating shape, rotation invariant features, analogous rotation invariants, covariance matrix
备注: 4 pages, 4 figures
点击查看摘要
Abstract:PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans, or shape similarity metric allowing their inexpensive comparison (modulo rotation) without costly optimization over rotations.
78. 【2601.03323】Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
链接:https://arxiv.org/abs/2601.03323
作者:Oran Duan,Yinghua Shen,Yingzhu Lv,Luyang Jie,Yaxin Liu,Qiong Wu
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
关键词:greatly promoted research, coarse semantic control, dance motion generation, Advances in generative, autoregressive dance motion
备注: 12 pages, 13 figures
点击查看摘要
Abstract:Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
79. 【2601.03317】Deep Learning-Based Image Recognition for Soft-Shell Shrimp Classification
链接:https://arxiv.org/abs/2601.03317
作者:Yun-Hao Zhang,I-Hsien Ting,Dario Liberona,Yun-Hsiu Liu,Kazunori Minetaki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:grow annually, integration of information, stable and continues, continues to grow, information technology
备注:
点击查看摘要
Abstract:With the integration of information technology into aquaculture, production has become more stable and continues to grow annually. As consumer demand for high-quality aquatic products rises, freshness and appearance integrity are key concerns. In shrimp-based processed foods, freshness declines rapidly post-harvest, and soft-shell shrimp often suffer from head-body separation after cooking or freezing, affecting product appearance and consumer perception. To address these issues, this study leverages deep learning-based image recognition for automated classification of white shrimp immediately after harvest. A convolutional neural network (CNN) model replaces manual sorting, enhancing classification accuracy, efficiency, and consistency. By reducing processing time, this technology helps maintain freshness and ensures that shrimp transportation businesses meet customer demands more effectively.
80. 【2601.03309】VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
链接:https://arxiv.org/abs/2601.03309
作者:Jianke Zhang,Xiaoyu Chen,Qiuyue Wang,Mingsheng Li,Yanjiang Guo,Yucheng Hu,Jiajun Zhang,Shuai Bai,Junyang Lin,Jianyu Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large Vision-Language Models, integrate pretrained large, pretrained large Vision-Language, gaining significant attention, Vision-Language Models
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.
81. 【2601.03305】Mass Concept Erasure in Diffusion Models with Concept Hierarchy
链接:https://arxiv.org/abs/2601.03305
作者:Jiahang Tu,Ye Li,Yiming Wu,Hanbin Zhao,Chao Zhang,Hui Qian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:general generative capabilities, preserving general generative, suppress specific concepts, generative capabilities, models has raised
备注: This paper has been accepted by AAAI 2026
点击查看摘要
Abstract:The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.
82. 【2601.03302】CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception
链接:https://arxiv.org/abs/2601.03302
作者:Mohammad Rostami,Atik Faysal,Hongtao Xia,Hadi Kasasbeh,Ziang Gao,Huaxia Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:generated synthetic variants, systematically generated synthetic, present CageDroneRF, synthetic variants, identification built
备注:
点击查看摘要
Abstract:We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i) precisely controls Signal-to-Noise Ratio (SNR), (ii) injects interfering emitters, and (iii) applies frequency shifts with label-consistent bounding-box transformations for detection. This dataset spans a wide range of contemporary drone models, many unavailable in current public datasets, and acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, CDRF aims to accelerate progress toward robust, generalizable RF perception models.
83. 【2601.03286】HyperCLOVA X 32B Think
链接:https://arxiv.org/abs/2601.03286
作者:NAVER Cloud HyperCLOVA X Team
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:vision-language model designed, cultural context, linguistic and cultural, Korean linguistic, agentic ability
备注: Technical Report
点击查看摘要
Abstract:In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.
84. 【2601.04163】Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models
链接:https://arxiv.org/abs/2601.04163
作者:Erik Thiringer,Fredrik K. Gustafsson,Kajsa Ledesma Eriksson,Mattias Rantalainen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:offer general encoders, Pathology foundation models, Pathology foundation, computational pathology, aiming to offer
备注:
点击查看摘要
Abstract:Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.
85. 【2601.03924】A low-complexity method for efficient depth-guided image deblurring
链接:https://arxiv.org/abs/2601.03924
作者:Ziyao Yi,Diego Valsesia,Tiziano Bianchi,Enrico Magli
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:highly ill-posed nature, ill-posed nature, imaging due, highly ill-posed, challenging problem
备注:
点击查看摘要
Abstract:Image deblurring is a challenging problem in imaging due to its highly ill-posed nature. Deep learning models have shown great success in tackling this problem but the quest for the best image quality has brought their computational complexity up, making them impractical on anything but powerful servers. Meanwhile, recent works have shown that mobile Lidars can provide complementary information in the form of depth maps that enhance deblurring quality. In this paper, we introduce a novel low-complexity neural network for depth-guided image deblurring. We show that the use of the wavelet transform to separate structural details and reduce spatial redundancy as well as efficient feature conditioning on the depth information are essential ingredients in developing a low-complexity model. Experimental results show competitive image quality against recent state-of-the-art models while reducing complexity by up to two orders of magnitude.
86. 【2601.03875】Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations
链接:https://arxiv.org/abs/2601.03875
作者:Yuyang Fu,Xiuzhen Guo,Ji Shi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved significant advancements, medical image segmentation, achieved significant, Deep Reinforcement Learning, medical image
备注:
点击查看摘要
Abstract:Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3\% in both Dice and IoU scores.
87. 【2601.03499】GeoDiff-SAR: A Geometric Prior Guided Diffusion Model for SAR Image Generation
链接:https://arxiv.org/abs/2601.03499
作者:Fan Zhang,Xuanting Wu,Fei Ma,Qiang Yin,Yuxin Hu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, highly sensitive, sensitive to observation
备注: 22 pages, 17 figures
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) imaging results are highly sensitive to observation geometries and the geometric parameters of targets. However, existing generative methods primarily operate within the image domain, neglecting explicit geometric information. This limitation often leads to unsatisfactory generation quality and the inability to precisely control critical parameters such as azimuth angles. To address these challenges, we propose GeoDiff-SAR, a geometric prior guided diffusion model for high-fidelity SAR image generation. Specifically, GeoDiff-SAR first efficiently simulates the geometric structures and scattering relationships inherent in real SAR imaging by calculating SAR point clouds at specific azimuths, which serves as a robust physical guidance. Secondly, to effectively fuse multi-modal information, we employ a feature fusion gating network based on Feature-wise Linear Modulation (FiLM) to dynamically regulate the weight distribution of 3D physical information, image control parameters, and textual description parameters. Thirdly, we utilize the Low-Rank Adaptation (LoRA) architecture to perform lightweight fine-tuning on the advanced Stable Diffusion 3.5 (SD3.5) model, enabling it to rapidly adapt to the distribution characteristics of the SAR domain. To validate the effectiveness of GeoDiff-SAR, extensive comparative experiments were conducted on real-world SAR datasets. The results demonstrate that data generated by GeoDiff-SAR exhibits high fidelity and effectively enhances the accuracy of downstream classification tasks. In particular, it significantly improves recognition performance across different azimuth angles, thereby underscoring the superiority of physics-guided generation.
88. 【2601.03391】Edit2Restore:Few-Shot Image Restoration via Parameter-Efficient Adaptation of Pre-trained Editing Models
链接:https://arxiv.org/abs/2601.03391
作者:M. Akın Yılmaz,Ahmet Bilican,Burak Can Biner,A. Murat Tekalp
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:traditionally required training, traditionally required, image editing models, Image, editing models
备注:
点击查看摘要
Abstract:Image restoration has traditionally required training specialized models on thousands of paired examples per degradation type. We challenge this paradigm by demonstrating that powerful pre-trained text-conditioned image editing models can be efficiently adapted for multiple restoration tasks through parameter-efficient fine-tuning with remarkably few examples. Our approach fine-tunes LoRA adapters on FLUX.1 Kontext, a state-of-the-art 12B parameter flow matching model for image-to-image translation, using only 16-128 paired images per task, guided by simple text prompts that specify the restoration operation. Unlike existing methods that train specialized restoration networks from scratch with thousands of samples, we leverage the rich visual priors already encoded in large-scale pre-trained editing models, dramatically reducing data requirements while maintaining high perceptual quality. A single unified LoRA adapter, conditioned on task-specific text prompts, effectively handles multiple degradations including denoising, deraining, and dehazing. Through comprehensive ablation studies, we analyze: (i) the impact of training set size on restoration quality, (ii) trade-offs between task-specific versus unified multi-task adapters, (iii) the role of text encoder fine-tuning, and (iv) zero-shot baseline performance. While our method prioritizes perceptual quality over pixel-perfect reconstruction metrics like PSNR/SSIM, our results demonstrate that pre-trained image editing models, when properly adapted, offer a compelling and data-efficient alternative to traditional image restoration approaches, opening new avenues for few-shot, prompt-guided image enhancement. The code to reproduce our results are available at: this https URL




