本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新591篇论文,其中:
- 自然语言处理136篇
- 信息检索20篇
- 计算机视觉98篇
自然语言处理
1. 【2601.05242】GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
链接:https://arxiv.org/abs/2601.05242
作者:Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Peter Belcak,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:diverse human preferences, increasingly capable, users expect, variety of scenarios, aligned with diverse
备注: NVIDIA-Tech Report
点击查看摘要
Abstract:As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
2. 【2601.05232】Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
链接:https://arxiv.org/abs/2601.05232
作者:P. Gilda(1),P. Dungarwal(1),A. Thongkham(1),E. T. Ajayi(2),S. Choudhary(1),T. M. Terol(1),C. Lam(1),J. P. Araujo(1),M. McFadyen-Mungalln(1),L. S. Liebovitch(1),P. T. Coleman(1),H. West(1),K. Sieck(3),S. Carter(3) ((1) Columbia University, (2) St John's University, (3) Toyota Research Institute)
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:measure levels, artificial intelligence, develop on-line tools, machine learning, learning and artificial
备注: 6 pages, 4 figures
点击查看摘要
Abstract:We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
3. 【2601.05201】Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
链接:https://arxiv.org/abs/2601.05201
作者:William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large vision-language models, Large vision-language, favoring textual prompts, highly capable, hallucinate by favoring
备注:
点击查看摘要
Abstract:Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
4. 【2601.05192】LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
链接:https://arxiv.org/abs/2601.05192
作者:Samy Haffoudhi,Fabian M. Suchanek,Nils Holzenberger
类目:Computation and Language (cs.CL)
关键词:mapping ambiguous mentions, knowledge graph construction, mapping ambiguous, graph construction, information extraction
备注:
点击查看摘要
Abstract:Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
5. 【2601.05184】Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop
链接:https://arxiv.org/abs/2601.05184
作者:Yaxuan Wang,Zhongteng Cai,Yujia Bao,Xueru Zhang,Yang Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, train future models, rapid advancement, advancement of large, large language
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
6. 【2601.05171】Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
链接:https://arxiv.org/abs/2601.05171
作者:Jihao Zhao,Ding Chen,Zhaoxin Fan,Kerun Xu,Mengting Hu,Bo Tang,Feiyu Xiong,Zhiyu li
类目:Computation and Language (cs.CL)
关键词:reconcile unbounded interaction, unbounded interaction streams, Existing long-term personalized, finite context constraints, dialogue systems struggle
备注:
点击查看摘要
Abstract:Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
7. 【2601.05170】Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
链接:https://arxiv.org/abs/2601.05170
作者:Rasmus Blanck,Bill Noble,Stergios Chatzikyriakidis
类目:Computation and Language (cs.CL)
关键词:Natural Language Inference, Natural Language Understanding, Natural Language, evaluating language models, evaluating language
备注:
点击查看摘要
Abstract:Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
8. 【2601.05167】RelayLLM: Efficient Reasoning via Collaborative Decoding
链接:https://arxiv.org/abs/2601.05167
作者:Chengsong Huang,Tong Zheng,Langlin Huang,Jinyuan Li,Haolin Liu,Jiaxin Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:resource-efficient Small Language, Large Language Models, Small Language Models, Large Language, Small Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
9. 【2601.05163】DocDancer: Towards Agentic Document-Grounded Information Seeking
链接:https://arxiv.org/abs/2601.05163
作者:Qintong Zhang,Xinjie Lv,Jialong Wu,Baixuan Li,Zhengwei Tao,Guochen Yan,Huanyao Zhang,Bin Wang,Jiahao Xu,Haitao Mi,Wentao Zhang
类目:Computation and Language (cs.CL)
关键词:answering questions grounded, Document Question Answering, Question Answering, answering questions, questions grounded
备注:
点击查看摘要
Abstract:Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
10. 【2601.05143】A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
链接:https://arxiv.org/abs/2601.05143
作者:Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:analysis requires accurate, disease analysis requires, requires accurate visual, accurate visual understanding, analysis requires
备注: Preprint, manuscript is under review
点击查看摘要
Abstract:Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
11. 【2601.05111】Agent-as-a-Judge
链接:https://arxiv.org/abs/2601.05111
作者:Runyang You,Hongru Cai,Caiqi Zhang,Qiancheng Xu,Meng Liu,Tiezheng Yu,Yongqi Li,Wenjie Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:leveraging large language, large language models, leveraging large, large language, language models
备注:
点击查看摘要
Abstract:LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.
12. 【2601.05106】oken-Level LLM Collaboration via FusionRoute
链接:https://arxiv.org/abs/2601.05106
作者:Nuoya Xiong,Yuhang Zhou,Hanqing Zeng,Zhaorun Chen,Furong Huang,Shuchao Bi,Lizhu Zhang,Zhuokai Zhao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, exhibit strengths, Large, language models
备注: 25 pages
点击查看摘要
Abstract:Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.
13. 【2601.05104】How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
链接:https://arxiv.org/abs/2601.05104
作者:Florence Bernays,Marco Henriques Pereira,Jochen Menges(University of Zurich)
类目:Computation and Language (cs.CL); General Economics (econ.GN)
关键词:human behavior, research examines, ChatGPT, participants, ethical dilemma
备注:
点击查看摘要
Abstract:This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.
14. 【2601.05103】Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content
链接:https://arxiv.org/abs/2601.05103
作者:Changxu Duan,Zhiyin Tan
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL)
关键词:cited content type, Semantically Orthogonal Framework, essential for research, research assessment, assessment and citation-aware
备注: Accepted at the 29th International Conference on Theory and Practice of Digital Libraries (TPDL 2025)
点击查看摘要
Abstract:Understanding the role of citations is essential for research assessment and citation-aware digital libraries. However, existing citation classification frameworks often conflate citation intent (why a work is cited) with cited content type (what part is cited), limiting their effectiveness in auto classification due to a dilemma between fine-grained type distinctions and practical classification reliability. We introduce SOFT, a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type, drawing inspiration from semantic role theory. We systematically re-annotate the ACL-ARC dataset using SOFT and release a cross-disciplinary test set sampled from ACT2. Evaluation with both zero-shot and fine-tuned Large Language Models demonstrates that SOFT enables higher agreement between human annotators and LLMs, and supports stronger classification performance and robust cross-domain generalization compared to ACL-ARC and SciCite annotation frameworks. These results confirm SOFT's value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures. All code and data are publicly available on GitHub this https URL.
15. 【2601.05099】Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts
链接:https://arxiv.org/abs/2601.05099
作者:Zhiyin Tan,Changxu Duan
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Identifying suitable datasets, question remains challenging, engines rely heavily, search engines rely, research question remains
备注: Accepted at the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)
点击查看摘要
Abstract:Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large-scale citation-context extraction, schema-guided dataset recognition with Large Language Models, and provenance-preserving entity resolution. We evaluate the system on eight survey-derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering gold-standard datasets, the method also surfaces additional datasets not documented in the surveys. Expert assessments across five top-level Fields of Science indicate that a substantial portion of the additional datasets are considered high utility, and some are regarded as novel for the specific topics chosen by the experts. These findings establish citation-context mining as an effective and generalizable paradigm for dataset discovery, particularly in settings where datasets lack sufficient or reliable metadata. To support reproducibility and future extensions, we release our code, evaluation datasets, and results on GitHub (this https URL).
16. 【2601.05091】Code-Mix Sentiment Analysis on Hinglish Tweets
链接:https://arxiv.org/abs/2601.05091
作者:Aashi Garg,Aneshya Das,Arshi Arya,Anushka Goyal,Aditi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Hindi and English, platforms like Twitter, Natural Language Processing, monitoring in India, India is increasingly
备注: Accepted at the 9th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2025), Fukuoka, Japan
点击查看摘要
Abstract:The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
17. 【2601.05075】SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment
链接:https://arxiv.org/abs/2601.05075
作者:Ziyang Chen,Zhenxuan Huang,Yile Wang,Weiqin Wang,Lu Yin,Hui Huang
类目:Computation and Language (cs.CL)
关键词:Traditional sentence embedding, methods employ token-level, non-generative pre-trained models, embedding methods employ, employ token-level contrastive
备注:
点击查看摘要
Abstract:Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.
18. 【2601.05062】Compositional Steering of Large Language Models with Steering Tokens
链接:https://arxiv.org/abs/2601.05062
作者:Gorjan Radevski,Kiril Gashteovski,Giwon Hong,Carolin Lawrence,Goran Glavaš
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:real-world applications requires, applications requires controllable, requires controllable output, Deploying LLMs, satisfies multiple desiderata
备注:
点击查看摘要
Abstract:Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} -- i.e., steering LLMs simultaneously towards multiple behaviors -- remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.
19. 【2601.05053】Reinforced Efficient Reasoning via Semantically Diverse Exploration
链接:https://arxiv.org/abs/2601.05053
作者:Ziqi Zhao,Zhaochun Ren,Jiahong Zou,Liu Yang,Zhiwei Xu,Xuri Ge,Zhumin Chen,Xinyu Ma,Daiting Shi,Shuaiqiang Wang,Dawei Yin,Xin Xin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement learning, Carlo Tree Search, Monte Carlo Tree, large language models, learning with verifiable
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at this https URL.
20. 【2601.05051】Publishing FAIR and Machine-actionable Reviews in Materials Science: The Case for Symbolic Knowledge in Neuro-symbolic Artificial Intelligence
链接:https://arxiv.org/abs/2601.05051
作者:Jennifer D'Souza,Soren Auer,Eleni Poupaki,Alex Watkins,Anjana Devi,Riikka L. Puurunen,Bora Karasulu,Adrie Mackus,Erwin Kessels
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Theory (cs.IT)
关键词:static PDF tables, Research Knowledge Graph, key insights remain, insights remain locked, static PDF
备注: 35 pages, 11 figures
点击查看摘要
Abstract:Scientific reviews are central to knowledge integration in materials science, yet their key insights remain locked in narrative text and static PDF tables, limiting reuse by humans and machines alike. This article presents a case study in atomic layer deposition and etching (ALD/E) where we publish review tables as FAIR, machine-actionable comparisons in the Open Research Knowledge Graph (ORKG), turning them into structured, queryable knowledge. Building on this, we contrast symbolic querying over ORKG with large language model-based querying, and argue that a curated symbolic layer should remain the backbone of reliable neurosymbolic AI in materials science, with LLMs serving as complementary, symbolically grounded interfaces rather than standalone sources of truth.
21. 【2601.05038】ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG
链接:https://arxiv.org/abs/2601.05038
作者:Jianbo Li,Yi Jiang,Sendong Zhao,Bairui Hu,Haochun Wang,Bing Qin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLMs stay accurate, feeding long documents, stay accurate, slow and expensive, feeding long
备注: Code is available at [this https URL](https://github.com/liunian-Jay/ArcAligner.git)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ''compressing'' these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context *Aligner*), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ''gating'' system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.
22. 【2601.05019】Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models
链接:https://arxiv.org/abs/2601.05019
作者:Yueqing Hu,Xinyang Peng,Shuting Peng,Hanqi Wang,Tianhong Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:Recent Large Reasoning, Recent Large, Large Reasoning Models, Large Reasoning, Hán Dān Xué
备注: 7 pages, 7 figures
点击查看摘要
Abstract:Recent Large Reasoning Models trained via reinforcement learning exhibit a "natural" alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation -- training student models to mimic these traces via Supervised Fine-Tuning (SFT) -- fails to transmit this cognitive structure. Testing the "Hán Dān Xué Bù" (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a "Functional Alignment Collapse": while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines ("Negative Transfer"). Our analysis suggests that SFT induces a "Cargo Cult" effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher's dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.
23. 【2601.05004】Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei
链接:https://arxiv.org/abs/2601.05004
作者:Peng Wang,Xilin Tao,Siyi Yao,Jiageng Wu,Yuntao Zou,Zhuotao Tian,Libo Qin,Dagang Li
类目:Computation and Language (cs.CL)
关键词:complex psychological states, self-destructive behavior, linked to complex, complex psychological, psychological states
备注: Preprint
点击查看摘要
Abstract:Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs' training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.
24. 【2601.05002】On the Hidden Objective Biases of Group-based Reinforcement Learning
链接:https://arxiv.org/abs/2601.05002
作者:Aleksandar Fontana,Marco Simoni,Giulio Rossolini,Andrea Saracino,Paolo Mori
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Group-based reinforcement learning, large language models, post-train large language, Relative Policy Optimization, Group Relative Policy
备注:
点击查看摘要
Abstract:Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
25. 【2601.04992】Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization
链接:https://arxiv.org/abs/2601.04992
作者:Xueyun Tian(1 and 2),Minghua Ma(3),Bingbing Xu(1 and 4),Nuoyan Lyu(1 and 2),Wei Li,Heng Dong(4),Zheng Chu(3),Yuanzhuo Wang(1),Huawei Shen(1 and 2) ((1) CAS Key Laboratory of AI Safety, Institute of Computing Technology, CAS, Beijing, China, (2) University of Chinese Academy of Sciences, Beijing, China (3) Harbin Institute of Technology, Harbin, China, (4) Tsinghua University, Beijing, China)
类目:Computation and Language (cs.CL)
关键词:large language models, Supervised fine-tuning, language models, common approach, approach for enabling
备注: Code and data are available at [this https URL](https://github.com/Eureka-Maggie/GLOW)
点击查看摘要
Abstract:Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.
26. 【2601.04973】ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning
链接:https://arxiv.org/abs/2601.04973
作者:Minda Hu,Zexuan Qiu,Zenan Xu,Kun Li,Bo Zhou,Irwin King
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:intricate cognitive behaviors, solve complex tasks, Large Reasoning Models, enabling intricate cognitive, Recent breakthroughs
备注:
点击查看摘要
Abstract:Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking'', where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the 'cold start' phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.
27. 【2601.04963】xt as a Universal Interface for Transferable Personalization
链接:https://arxiv.org/abs/2601.04963
作者:Yuting Liu,Jian Guan,Jia-Nan Li,Wei Wu,Jiang-Ming Yang,Jianzhe Zhao,Guibing Guo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:study the problem, problem of personalization, personalization in large, large language models, large language
备注:
点击查看摘要
Abstract:We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box'' profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc -- outperforming substantially larger open-source models -- while exhibiting strong transferability across tasks, model families, and interaction formats.
28. 【2601.04960】A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction
链接:https://arxiv.org/abs/2601.04960
作者:Qing Wang,Zehan Li,Yaodong Song,Hongjie Chen,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Xuelong Li
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:Injected Emotional-Attribution Thinking, termed Injected Emotional-Attribution, strategy termed Injected, Emotional-Attribution Thinking, termed Injected
备注:
点击查看摘要
Abstract:This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.
29. 【2601.04932】GenProve: Learning to Generate Text with Fine-Grained Provenance
链接:https://arxiv.org/abs/2601.04932
作者:Jingxuan Wei,Xingyue Wang,Yanghaoyu Liao,Jie Dong,Yuchen Liu,Caijun Jia,Bihui Yu,Junnan Zhu
类目:Computation and Language (cs.CL)
关键词:Large language models, cited source supports, Large language, common solution, generated claim
备注:
点击查看摘要
Abstract:Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.
30. 【2601.04925】Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences
链接:https://arxiv.org/abs/2601.04925
作者:Arkadiusz Modzelewski,Paweł Golik,Anna Kołos,Giovanni Da San Martino
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, generate highly persuasive, raising concerns
备注: Preprint; Paper is currently under review at a major NLP conference
点击查看摘要
Abstract:Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.
31. 【2601.04897】V-FAT: Benchmarking Visual Fidelity Against Text-bias
链接:https://arxiv.org/abs/2601.04897
作者:Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive performance
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
32. 【2601.04889】Faithful Summarisation under Disagreement via Belief-Level Aggregation
链接:https://arxiv.org/abs/2601.04889
作者:Favour Yahdii Aghaebe,Tanefa Apekey,Elizabeth Williams,Nafise Sadat Moosavi
类目:Computation and Language (cs.CL)
关键词:over-represent majority opinions, genuinely conflicting viewpoints, implicitly smooth disagreement, involve genuinely conflicting, majority opinions
备注:
点击查看摘要
Abstract:Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
33. 【2601.04885】CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
链接:https://arxiv.org/abs/2601.04885
作者:Ao Sun,Xiaoyu Wang,Zhe Tan,Yu Li,Jiachen Zhu,Shu Su,Yuheng Jia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, enforcing universal consensus, Language Models, textbf
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at this https URL.
34. 【2601.04879】Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis
链接:https://arxiv.org/abs/2601.04879
作者:Mingyue Cheng,Daoyu Wang,Qi Liu,Shuo Yu,Xiaoyu Tao,Yuqian Wang,Chengzhong Chu,Yu Duan,Mingkang Long,Enhong Chen
类目:Computation and Language (cs.CL)
关键词:high-stakes business decisions, Synthesizing informative commercial, Synthesizing informative, deep research agents, business decisions
备注: 26 Pages, 9 Figures, 7 Tables
点击查看摘要
Abstract:Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at this https URL.
35. 【2601.04878】Higher-Order Knowledge Representations for Agentic Scientific Reasoning
链接:https://arxiv.org/abs/2601.04878
作者:Isabella A. Stewart,Markus J. Buehler
类目:Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:heterogeneous experimental data, inquiry requires systems-level, integrates heterogeneous experimental, Large Language Models, Scientific inquiry requires
备注:
点击查看摘要
Abstract:Scientific inquiry requires systems-level reasoning that integrates heterogeneous experimental data, cross-domain knowledge, and mechanistic evidence into coherent explanations. While Large Language Models (LLMs) offer inferential capabilities, they often depend on retrieval-augmented contexts that lack structural depth. Traditional Knowledge Graphs (KGs) attempt to bridge this gap, yet their pairwise constraints fail to capture the irreducible higher-order interactions that govern emergent physical behavior. To address this, we introduce a methodology for constructing hypergraph-based knowledge representations that faithfully encode multi-entity relationships. Applied to a corpus of ~1,100 manuscripts on biocomposite scaffolds, our framework constructs a global hypergraph of 161,172 nodes and 320,201 hyperedges, revealing a scale-free topology (power law exponent ~1.23) organized around highly connected conceptual hubs. This representation prevents the combinatorial explosion typical of pairwise expansions and explicitly preserves the co-occurrence context of scientific formulations. We further demonstrate that equipping agentic systems with hypergraph traversal tools, specifically using node-intersection constraints, enables them to bridge semantically distant concepts. By exploiting these higher-order pathways, the system successfully generates grounded mechanistic hypotheses for novel composite materials, such as linking cerium oxide to PCL scaffolds via chitosan intermediates. This work establishes a "teacherless" agentic reasoning system where hypergraph topology acts as a verifiable guardrail, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.
36. 【2601.04875】EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis
链接:https://arxiv.org/abs/2601.04875
作者:Xuanguang Pan,Chongyang Tao,Jiayuan Bai,Jianling Gao,Zhengwei Tao,Xiansheng Zhou,Gavin Cheung,Shuai Ma
类目:Computation and Language (cs.CL)
关键词:remains challenging due, Training effective, models remains challenging, remains challenging, challenging due
备注: 18 pages
点击查看摘要
Abstract:Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.
37. 【2601.04859】A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs
链接:https://arxiv.org/abs/2601.04859
作者:Maxime Delmas,Lei Xu,André Freitas
类目:Computation and Language (cs.CL)
关键词:Standard RAG pipelines, Standard RAG, RAG pipelines based, multi-hop queries due, complex multi-hop queries
备注: 23 pages, 10 figures, 6 tables
点击查看摘要
Abstract:Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at this https URL.
38. 【2601.04857】MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News
链接:https://arxiv.org/abs/2601.04857
作者:Zhiwei Liu,Paul Thompson,Jiaqi Rong,Baojie Qu,Runteng Guo,Min Peng,Qianqian Xie,Sophia Ananiadou
类目:Computation and Language (cs.CL)
关键词:coarse binary labels, methods evaluate veracity, Online misinformation, increasingly pervasive, binary labels
备注: Work in progress
点击查看摘要
Abstract:Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at this https URL.
39. 【2601.04854】oken Maturation: Autoregressive Language Generation via Continuous Token Dynamics
链接:https://arxiv.org/abs/2601.04854
作者:Oshri Naparstek
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:discrete token sequences, conventionally defined, Autoregressive language, Autoregressive language models, continuous token representations
备注: In preperation to ICML 2026
点击查看摘要
Abstract:Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.
Comments:
In preperation to ICML 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2601.04854 [cs.CL]
(or
arXiv:2601.04854v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.04854
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Oshri Naparstek [view email] [v1]
Thu, 8 Jan 2026 11:44:34 UTC (5,416 KB)
40. 【2601.04853】RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection
链接:https://arxiv.org/abs/2601.04853
作者:Zhiwei Liu,Runteng Guo,Baojie Qu,Yuechen Jiang,Min Peng,Qianqian Xie,Sophia Ananiadou
类目:Computation and Language (cs.CL)
关键词:Cross-domain misinformation detection, knowledge and discourse, Cross-domain misinformation, substantial differences, differences in knowledge
备注:
点击查看摘要
Abstract:Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample's semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at this https URL.
41. 【2601.04833】When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection
链接:https://arxiv.org/abs/2601.04833
作者:Ke Sun,Guangsheng Bao,Han Cui,Yue Zhang
类目:Computation and Language (cs.CL)
关键词:Zero-shot detection methods, typically aggregate token-level, temporal dynamics inherent, Zero-shot detection, text typically aggregate
备注:
点击查看摘要
Abstract:Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24--32\% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.
42. 【2601.04823】DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation
链接:https://arxiv.org/abs/2601.04823
作者:Guanzhi Deng,Bo Li,Ronghao Chen,Huacan Wang,Linqi Song,Lijie Wen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, scaling Large Language, Language Models, Large Language, scaling Large
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert's demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.
43. 【2601.04795】Defense Against Indirect Prompt Injection via Tool Result Parsing
链接:https://arxiv.org/abs/2601.04795
作者:Qiang Yu,Xinran Cheng,Chuanyi Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
关键词:systems and robotics, indirect prompt injection, transition from digital, digital assistants, controllers in autonomous
备注: 20 pages, 3 figures, 5 tables
点击查看摘要
Abstract:As LLM agents transition from digital assistants to physical controllers in autonomous systems and robotics, they face an escalating threat from indirect prompt injection. By embedding adversarial instructions into the results of tool calls, attackers can hijack the agent's decision-making process to execute unauthorized actions. This vulnerability poses a significant risk as agents gain more direct control over physical environments. Existing defense mechanisms against Indirect Prompt Injection (IPI) generally fall into two categories. The first involves training dedicated detection models; however, this approach entails high computational overhead for both training and inference, and requires frequent updates to keep pace with evolving attack vectors. Alternatively, prompt-based methods leverage the inherent capabilities of LLMs to detect or ignore malicious instructions via prompt engineering. Despite their flexibility, most current prompt-based defenses suffer from high Attack Success Rates (ASR), demonstrating limited robustness against sophisticated injection attacks. In this paper, we propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code. Our approach achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods. Code is available at GitHub.
44. 【2601.04790】Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework
链接:https://arxiv.org/abs/2601.04790
作者:Junhyuk Choi,Jeongyoun Kwon,Heeju Kim,Haeun Cho,Hayeong Jung,Sehee Min,Bugeun Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:systems utilizing large, utilizing large language, large language models, Multi-agent systems utilizing, interactions remains underexplored
备注: Preprint
点击查看摘要
Abstract:Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven's power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.
45. 【2601.04789】NC2C: Automated Convexification of Generic Non-Convex Optimization Problems
链接:https://arxiv.org/abs/2601.04789
作者:Xinyue Peng,Yanming Liu,Yihan Cang,Yuwei Zhang,Xinyi Wang,Songhang Deng,Jiannan Cao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:complex objective functions, posing intractable challenges, traditional solvers due, Non-convex optimization problems, engineering design
备注: First version of NC2C
点击查看摘要
Abstract:Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs' mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3\% execution rate and a 76\% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C's ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.
46. 【2601.04778】CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
链接:https://arxiv.org/abs/2601.04778
作者:Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:achieve strong multimodal, strong multimodal understanding, achieve strong, Video-language models, understanding but remain
备注:
点击查看摘要
Abstract:Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
47. 【2601.04768】LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
链接:https://arxiv.org/abs/2601.04768
作者:Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:identity alongside semantics, Dense retrieval, encode language identity, language identity alongside, multilingual embeddings encode
备注: 16 pages, 3 figures
点击查看摘要
Abstract:Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
48. 【2601.04767】AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search
链接:https://arxiv.org/abs/2601.04767
作者:Zefang Zong,Dingwei Chen,Yang Li,Qi Yi,Bo Zhou,Chengming Li,Bo Qian,Peng Chen,Jie Jiang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:interleaving internal reasoning, LLM agents, Turn-based Policy Optimization, external tool interactions, Agentic Turn-based Policy
备注:
点击查看摘要
Abstract:LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at this https URL.
49. 【2601.04766】Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence
链接:https://arxiv.org/abs/2601.04766
作者:Shengyin Sun,Yiming Li,Renxi Liu,Weizhe Lin,Hui-Ling Zhen,Xianzhi Yu,Mingxuan Yuan,Chen Ma
类目:Computation and Language (cs.CL)
关键词:Decoding accelerates LLM, accelerates LLM inference, Speculative Decoding, Judge Decoding accelerates, accelerates LLM
备注: 16 pages
点击查看摘要
Abstract:Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.
50. 【2601.04765】Differential syntactic and semantic encoding in LLMs
链接:https://arxiv.org/abs/2601.04765
作者:Santiago Acevedo,Alessandro Laio,Marco Baroni
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
关键词:Large Language Models, Language Models, Large Language, syntactic and semantic, semantic information
备注:
点击查看摘要
Abstract:We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.
51. 【2601.04758】PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
链接:https://arxiv.org/abs/2601.04758
作者:Yehoon Jang,Chaewon Lee,Hyun-seok Min,Sungchul Choi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:USPTO adjudicates thousands, parte appeals, Appeal Board, Trial and Appeal, Patent Trial
备注: Accepted at the NLLP Workshop at EMNLP 2025
点击查看摘要
Abstract:The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at this https URL.
52. 【2601.04742】ool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval
链接:https://arxiv.org/abs/2601.04742
作者:Seyeon Jeong,Yeonjun Choi,JongWook Kim,Beakcheol Jang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, multi-agent debate framework, Multi-Agent Debate
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.
53. 【2601.04740】RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation
链接:https://arxiv.org/abs/2601.04740
作者:Huawei Zheng,Xinqi Jiang,Sen Yang,Shouling Ji,Yingcai Wu,Dazhen Deng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, unique safety risks, introduce unique safety, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.
54. 【2601.04736】AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs
链接:https://arxiv.org/abs/2601.04736
作者:Han Zhu,Jiale Chen,Chengkun Cai,Shengjie Sun,Haoran Li,Yujin Zhou,Chi-Min Chan,Pengcheng Wen,Lei Li,Sirui Han,Yike Guo
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Multi-modal Large Language, Large Language, Language Models, interactive applications
备注:
点击查看摘要
Abstract:Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
55. 【2601.04731】Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models
链接:https://arxiv.org/abs/2601.04731
作者:Shuyang Jiang,Yuhao Wang,Ya Zhang,Yanfeng Wang,Yu Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:positive homogeneous prompts, Current critic-free, homogeneous prompts, resulting in waste, rollouts due
备注: 22 pages
点击查看摘要
Abstract:Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.
56. 【2601.04730】Automatic Classifiers Underdetect Emotions Expressed by Men
链接:https://arxiv.org/abs/2601.04730
作者:Ivan Smirnov,Segun T. Aroyehun,Paul Plener,David Garcia
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:tools perform reliably, widespread adoption, makes it important, important to ensure, perform reliably
备注:
点击查看摘要
Abstract:The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.
57. 【2601.04726】Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning
链接:https://arxiv.org/abs/2601.04726
作者:Yuyang Hu,Jiongnan Liu,Jiejun Tan,Yutao Zhu,Zhicheng Dou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, increasingly deployed, deployed as intelligent, Large
备注: 19 pages,6 figures
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments. To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences to support downstream decision-making. However, most existing approaches organize and store memories in a flat manner and rely on simple similarity-based retrieval techniques. Even when structured memory is introduced, existing methods often struggle to explicitly capture the logical relationships among experiences or memory units. Moreover, memory access is largely detached from the constructed structure and still depends on shallow semantic retrieval, preventing agents from reasoning logically over long-horizon dependencies. In this work, we propose CompassMem, an event-centric memory framework inspired by Event Segmentation Theory. CompassMem organizes memory as an Event Graph by incrementally segmenting experiences into events and linking them through explicit logical relations. This graph serves as a logic map, enabling agents to perform structured and goal-directed navigation over memory beyond superficial retrieval, progressively gathering valuable memories to support long-horizon reasoning. Experiments on LoCoMo and NarrativeQA demonstrate that CompassMem consistently improves both retrieval and reasoning performance across multiple backbone models.
58. 【2601.04720】Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
链接:https://arxiv.org/abs/2601.04720
作者:Mingxin Li,Yanzhao Zhang,Dingkun Long,Keqin Chen,Sibo Song,Shuai Bai,Zhibo Yang,Pengjun Xie,An Yang,Dayiheng Liu,Jingren Zhou,Junyang Lin
类目:Computation and Language (cs.CL)
关键词:Qwen family built, Qwen family, latest extensions, family built, Matryoshka Representation Learning
备注:
点击查看摘要
Abstract:In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
59. 【2601.04716】Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents
链接:https://arxiv.org/abs/2601.04716
作者:Yonghyun Jun,Junhyuk Choi,Jihyeong Park,Hwanhee Lee
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, arbitrary text inputs, identity remain weakly, based on Large
备注: 27 pages
点击查看摘要
Abstract:Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character's identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM's pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{"Fame Fades"}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{"Nature Remains"}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.
60. 【2601.04711】DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs
链接:https://arxiv.org/abs/2601.04711
作者:Anh Thi-Hoang Nguyen,Khanh Quoc Tran,Tin Van Huynh,Phuoc Tan-Hoang Nguyen,Cam Tan Nguyen,Kiet Van Nguyen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:production environments remains, environments remains significantly, remains significantly constrained, plausible-sounding outputs, fabricate information
备注:
点击查看摘要
Abstract:The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations--fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types--factual, noisy, and adversarial--to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80\%, compared to a baseline encoder-only score of 32.83\%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.
61. 【2601.04710】Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning
链接:https://arxiv.org/abs/2601.04710
作者:Feihu Jin,Shipeng Cen,Ying Tan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Fine-tuning large language, achieved remarkable success, substantial memory overhead, Fine-tuning large, critical bottleneck
备注: 12pages, 6figures
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.
62. 【2601.04700】PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards
链接:https://arxiv.org/abs/2601.04700
作者:Mukesh Ghimire,Aosong Feng,Liwen You,Youzhi Luo,Fang Liu,Xuan Zhu
类目:Computation and Language (cs.CL)
关键词:post-training Large Language, Large Language Models, Large Language, Current techniques, post-training Large
备注: Preprint. Under Review
点击查看摘要
Abstract:Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.
63. 【2601.04698】ourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning
链接:https://arxiv.org/abs/2601.04698
作者:Yinuo Wang,Mining Tan,Wenxiang Jiao,Xiaoxi Li,Hao Wang,Xuanyu Zhang,Yuan Lu,Weiming Dong
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:sophisticated decision-making process, requires synthesizing multifaceted, synthesizing multifaceted information, Travel planning, travel planning approaches
备注:
点击查看摘要
Abstract:Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs' set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.
64. 【2601.04696】A Method for Constructing a Digital Transformation Driving Mechanism Based on Semantic Understanding of Large Models
链接:https://arxiv.org/abs/2601.04696
作者:Huayi Liu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:intelligent decision-making basis, insufficient semantic understanding, Bidirectional Encoder Representations, faced with problems, unstructured data
备注:
点击查看摘要
Abstract:In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.
65. 【2601.04693】hunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding
链接:https://arxiv.org/abs/2601.04693
作者:Sungmok Jung,Yeonkyoung So,Joonhak Lee,Sangho Kim,Yelim Ahn,Jaejin Lee
类目:Computation and Language (cs.CL)
关键词:challenge large language, large language models, Korean negation, Korean, challenge large
备注:
点击查看摘要
Abstract:Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.
66. 【2601.04692】See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
链接:https://arxiv.org/abs/2601.04692
作者:Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:complementary angles, examine hateful memes, intervene them prior, applying a range, range of strategies
备注:
点击查看摘要
Abstract:In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.
67. 【2601.04688】oolGate: Contract-Grounded and Verified Tool Execution for LLMs
链接:https://arxiv.org/abs/2601.04688
作者:Yanming Liu,Xinyue Peng,Jiannan Cao,Xinyi Wang,Songhang Deng,Jintao Chen,Jianwei Yin,Xuhong Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
关键词:Large Language Models, demonstrated remarkable capabilities, Large Language, demonstrated remarkable, remarkable capabilities
备注: First version of ToolGate
点击查看摘要
Abstract:Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool's result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.
68. 【2601.04672】Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
链接:https://arxiv.org/abs/2601.04672
作者:Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:diagnosis challenges VLMs, requires extensive labels, disease diagnosis challenges, conventional fine-tuning requires, fine-tuning requires extensive
备注: This paper is submitted for review to ACL 2026. It is 17 pages long and includes 5 figures. The corresponding authors are Tao Fang and Lina Lu
点击查看摘要
Abstract:Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
69. 【2601.04664】CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
链接:https://arxiv.org/abs/2601.04664
作者:Yifan Le,Yunliang Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multilingual large language, remains poorly understood, level remains poorly, Multilingual large, neuron level remains
备注: 10 pages, 6 figures. Work in progress
点击查看摘要
Abstract:Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.
70. 【2601.04646】Succeeding at Scale: Automated Multi-Retriever Fusion and Query-Side Adaptation for Multi-Tenant Search
链接:https://arxiv.org/abs/2601.04646
作者:Prateek Jain,Shabari S Nair,Ritesh Goru,Prakhar Agarwal,Ajay Yadav,Yoga Sri Varshan Varadharajan,Constantine Caramanis
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:systems amass vast, amass vast user, Large-scale multi-tenant retrieval, effective domain adaptation, retrieval systems amass
备注:
点击查看摘要
Abstract:Large-scale multi-tenant retrieval systems amass vast user query logs yet critically lack the curated relevance labels required for effective domain adaptation. This "dark data" problem is exacerbated by the operational cost of model updates: jointly fine-tuning query and document encoders requires re-indexing the entire corpus, which is prohibitive in multi-tenant environments with thousands of isolated indices. To address these dual challenges, we introduce \textbf{DevRev Search}, a passage retrieval benchmark for technical customer support constructed through a fully automatic pipeline. We employ a \textbf{fusion-based candidate generation} strategy, pooling results from diverse sparse and dense retrievers, and utilize an LLM-as-a-Judge to perform rigorous \textbf{consistency filtering} and relevance assignment. We further propose a practical \textbf{Index-Preserving Adaptation} strategy: by fine-tuning only the query encoder via Low-Rank Adaptation (LoRA), we achieve competitive performance improvements while keeping the document index frozen. Our experiments on DevRev Search and SciFact demonstrate that targeting specific transformer layers in the query encoder yields optimal quality-efficiency trade-offs, offering a scalable path for personalized enterprise search.
71. 【2601.04641】DP-MGTD: Privacy-Preserving Machine-Generated Text Detection via Adaptive Differentially Private Entity Sanitization
链接:https://arxiv.org/abs/2601.04641
作者:Lionel Z. Wang,Yusheng Zhao,Jiabin Luo,Xinfeng Li,Lixu Wang,Yinan Peng,Haoyang Li,XiaoFeng Wang,Wei Dong
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:sensitive user data, systems necessitates processing, necessitates processing sensitive, processing sensitive user, detection systems necessitates
备注: 12 pages, 1 figure, 1 tables
点击查看摘要
Abstract:The deployment of Machine-Generated Text (MGT) detection systems necessitates processing sensitive user data, creating a fundamental conflict between authorship verification and privacy preservation. Standard anonymization techniques often disrupt linguistic fluency, while rigorous Differential Privacy (DP) mechanisms typically degrade the statistical signals required for accurate detection. To resolve this dilemma, we propose \textbf{DP-MGTD}, a framework incorporating an Adaptive Differentially Private Entity Sanitization algorithm. Our approach utilizes a two-stage mechanism that performs noisy frequency estimation and dynamically calibrates privacy budgets, applying Laplace and Exponential mechanisms to numerical and textual entities respectively. Crucially, we identify a counter-intuitive phenomenon where the application of DP noise amplifies the distinguishability between human and machine text by exposing distinct sensitivity patterns to perturbation. Extensive experiments on the MGTBench-2.0 dataset show that our method achieves near-perfect detection accuracy, significantly outperforming non-private baselines while satisfying strict privacy guarantees.
72. 【2601.04638】SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
链接:https://arxiv.org/abs/2601.04638
作者:Sirry Chen,Jieyi Wang,Wei Chen,Zhongyu Wei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:intrinsically speech-centric, medical speech data, speech data, Knowledge Capability Injection, Medical
备注:
点击查看摘要
Abstract:Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.
73. 【2601.04633】MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark
链接:https://arxiv.org/abs/2601.04633
作者:Anyang Song,Ying Cheng,Yiqian Xu,Rui Feng
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, textbf, constantly evolving
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.
74. 【2601.04632】From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset
链接:https://arxiv.org/abs/2601.04632
作者:Haneul Yoo,Won Ik Cho,Geunhye Kim,Jiyoon Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, English-centric training data, achieve strong performance, progress remains uneven, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.
75. 【2601.04611】Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR
链接:https://arxiv.org/abs/2601.04611
作者:Yihong Tang,Kehai Chen,Xuefeng Bai,Benyou Wang,Zeming Liu,Haifeng Wang,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Current role-playing agents, imitating surface-level behaviors, Current role-playing, approach lacks internal, internal cognitive consistency
备注:
点击查看摘要
Abstract:Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.
76. 【2601.04609】When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
链接:https://arxiv.org/abs/2601.04609
作者:Rhea Kapur,Robert Hawkins,Elisa Kreiss
类目:Computation and Language (cs.CL)
关键词:Vision-language models, visual content accessible, accessible via text-based, Vision-language, make visual content
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.
77. 【2601.04600】On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions
链接:https://arxiv.org/abs/2601.04600
作者:Zhiyuan He,Binghan Chen,Tianxiang Xiong,Ziyang Sun,Mozhao Zhu,Xi Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:show superior efficiency, updating single-hop facts, Recent advances, show superior, facts in transformers
备注:
点击查看摘要
Abstract:Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the "hopping-too-late" problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of "hopping-too-late" and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.
78. 【2601.04597】HaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai -- Technical Report
链接:https://arxiv.org/abs/2601.04597
作者:KBTG Labs:Anuruth Lertpiya,Danupat Khamnuansin,Kantapong Sucharitpongpan,Pornchanan Balee,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
类目:Computation and Language (cs.CL)
关键词:Large Language Models, demonstrated significant potential, automate complex tasks, Large Language, banking and finance
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.
79. 【2601.04582】Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization
链接:https://arxiv.org/abs/2601.04582
作者:Mizanur Rahman,Mohammed Saidul Islam,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque
类目:Computation and Language (cs.CL)
关键词:systems translate natural, translate natural language, natural language queries, systems translate, translate natural
备注: Accepted to EACL Main Conference
点击查看摘要
Abstract:Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at this https URL.
80. 【2601.04574】FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback
链接:https://arxiv.org/abs/2601.04574
作者:Seongyeub Chu,Jongwoo Kim,Munyong Yi
类目:Computation and Language (cs.CL)
关键词:numerical scores, recent research, actionable guidance, prediction of numerical, research in automated
备注:
点击查看摘要
Abstract:Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.
81. 【2601.04568】Neurosymbolic Retrievers for Retrieval-augmented Generation
链接:https://arxiv.org/abs/2601.04568
作者:Yash Saxena,Manas Gaur
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Retrieval Augmented Generation, Augmented Generation, large language models, made significant strides, overcoming key limitations
备注: 8 pages, 2 Figures, To Appear in IEEE Intelligent Systems
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance
82. 【2601.04566】BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
链接:https://arxiv.org/abs/2601.04566
作者:Yunhao Feng,Yige Li,Yutao Wu,Yingshui Tan,Yanming Guo,Yifan Ding,Kun Zhai,Xingjun Ma,Yugang Jiang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language model, Large language, agents execute tasks, textbf, language model
备注:
点击查看摘要
Abstract:Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbf{BackdoorAgent}, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbf{planning attacks}, \textbf{memory attacks}, and \textbf{tool-use attacks}, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbf{Agent QA}, \textbf{Agent Code}, \textbf{Agent Web}, and \textbf{Agent Drive}, covering both language-only and multimodal settings. Our empirical analysis shows that \textit{triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states.} For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58\% of planning attacks, 77.97\% of memory attacks, and 60.28\% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.
83. 【2601.04563】A Vision for Multisensory Intelligence: Sensing, Synergy, and Science
链接:https://arxiv.org/abs/2601.04563
作者:Paul Pu Liang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:spanning a synthesis, synthesis of language, multisensory artificial intelligence, artificial intelligence, Multisensory Intelligence group
备注:
点击查看摘要
Abstract:Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see this https URL.
84. 【2601.04548】Identifying Good and Bad Neurons for Task-Level Controllable LLMs
链接:https://arxiv.org/abs/2601.04548
作者:Wenjie Li,Guansong Pang,Hezhe Qiao,Debin Gao,David Lo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, complex mechanisms underlying
备注:
点击查看摘要
Abstract:Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.
85. 【2601.04537】Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
链接:https://arxiv.org/abs/2601.04537
作者:Tianle Wang,Zhongyuan Wu,Shenghao Jin,Hao Xu,Wei Chen,Ning Miao
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Reinforcement learning, large language model, verifiable rewards, learning with verifiable, central component
备注: pre-print
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.
86. 【2601.04534】BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation
链接:https://arxiv.org/abs/2601.04534
作者:Amit Bin Tariqul,A N M Zahid Hossain Milkan,Sahab-Al-Chowdhury,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:intellectual property protection, large language models, authorship attribution, intellectual property, property protection
备注: Under review, 12 pages, 7 figures, 5 tables
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.
87. 【2601.04526】Advancing Language Models for Code-related Tasks
链接:https://arxiv.org/abs/2601.04526
作者:Zhao Tian
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:driven significant progress, Recent advances, software engineering tasks, driven significant, significant progress
备注: Accepted by ICSE 2026 (DS)
点击查看摘要
Abstract:Recent advances in language models (LMs) have driven significant progress in various software engineering tasks. However, existing LMs still struggle with complex programming scenarios due to limitations in data quality, model architecture, and reasoning capability. This research systematically addresses these challenges through three complementary directions: (1) improving code data quality with a code difference-guided adversarial augmentation technique (CODA) and a code denoising technique (CodeDenoise); (2) enhancing model architecture via syntax-guided code LMs (LEAM and LEAM++); and (3) advancing model reasoning with a prompting technique (muFiX) and an agent-based technique (Specine). These techniques aim to promote the practical adoption of LMs in software development and further advance intelligent software engineering.
88. 【2601.04525】GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence
链接:https://arxiv.org/abs/2601.04525
作者:Yibo Zhao,Jiapeng Zhu,Zichen Ding,Xiang Li
类目:Computation and Language (cs.CL)
关键词:enhance Large Language, Large Language Models, Large Language, systems remain susceptible, Retrieval-Augmented Generation
备注: 18 pages
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at this https URL..
89. 【2601.04516】LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation
链接:https://arxiv.org/abs/2601.04516
作者:Yuxiao Ye,Yiming Zhang,Yiran Ma,Huiyuan Xie,Huining Zhu,Zhiyuan Liu
类目:Computation and Language (cs.CL)
关键词:enabled Multi-Agent Systems, solve complex tasks, Large Language Models, Large Language, simulate multi-party dialogues
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents' communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.
90. 【2601.04508】WESR: Scaling and Evaluating Word-level Event-Speech Recognition
链接:https://arxiv.org/abs/2601.04508
作者:Chenchen Yang,Kexin Huang,Liwei Fan,Qian Tu,Botian Jiang,Dong Zhang,Linqi Yin,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:laughing and crying, linguistic information, Speech conveys, non-verbal vocal events, non-verbal events remains
备注: 14 pages, 6 figures
点击查看摘要
Abstract:Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.
91. 【2601.04505】CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
链接:https://arxiv.org/abs/2601.04505
作者:Khandakar Shakib Al Hasan,Syed Rifat Raiyan,Hasin Mahtab Alvee,Wahid Sadik
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
关键词:Generating accurate circuit, Generating accurate, large language models, language descriptions remains, JSON schematic synthesis
备注: Under review, 13 pages, 11 figures, 2 tables
点击查看摘要
Abstract:Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronics design, as large language models (LLMs) frequently hallucinate in granular details, violate electrical constraints, and produce non-machine-readable outputs. We present CircuitLM, a novel multi-agent LLM-aided circuit design pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics through five sequential stages: (i) LLM-based component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning by an electronics expert agent, (iv) JSON schematic synthesis, and (v) force-directed SVG visualization. Anchored by a curated, embedding-powered component knowledge base. While LLMs often violate electrical constraints, CircuitLM bridges this gap by grounding generation in a verified and dynamically extensible component database, initially comprising 50 components. To ensure safety, we incorporate a hybrid evaluation framework, namely Dual-Metric Circuit Validation (DMCV), validated against human-expert assessments, which achieves high fidelity in microcontroller-centric designs. We evaluate the system on 100 diverse embedded-systems prompts across six LLMs and introduce DMCV to assess both structural and electrical validity. This work bridges natural language input to deployable hardware designs, enabling reliable circuit prototyping by non-experts. Our code and data will be made public upon acceptance.
92. 【2601.04497】Vision-Language Agents for Interactive Forest Change Analysis
链接:https://arxiv.org/abs/2601.04497
作者:James Brock,Ce Zhang,Nantheera Anantrasirichai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Modern forest monitoring, monitoring workflows increasingly, workflows increasingly benefit, forest monitoring workflows, Modern forest
备注: 5 pages, 4 figures, Submitted to IGARSS 2026
点击查看摘要
Abstract:Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at this https URL.
93. 【2601.04469】SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
链接:https://arxiv.org/abs/2601.04469
作者:Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, morphologically rich Uralic, critical for Large, rich Uralic languages, Self-Referential Atomicity Scoring
备注: Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67
点击查看摘要
Abstract:The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: this https URL
Comments:
Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67
Subjects:
Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
ACMclasses:
I.2.7; I.2.6; H.3.1
Cite as:
arXiv:2601.04469 [cs.CL]
(or
arXiv:2601.04469v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.04469
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
94. 【2601.04465】Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
链接:https://arxiv.org/abs/2601.04465
作者:Ignacio Sastre,Aiala Rosá
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:multiple natural language, propose Concept Tokens, Concept Tokens, lightweight method, method that adds
备注:
点击查看摘要
Abstract:We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.
95. 【2601.04463】Beyond Static Summarization: Proactive Memory Extraction for LLM Agents
链接:https://arxiv.org/abs/2601.04463
作者:Chengyuan Yang,Zequn Sun,Wei Wei,Wei Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:handle long-term interaction, vital for LLM, interaction and personalization, LLM agents, management is vital
备注:
点击查看摘要
Abstract:Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is "ahead-of-time", acting as a blind "feed-forward" process that misses important details because it doesn't know future tasks. Second, extraction is usually "one-off", lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.
96. 【2601.04461】Users Mispredict Their Own Preferences for AI Writing Assistance
链接:https://arxiv.org/abs/2601.04461
作者:Vivian Lai,Zana Buçinca,Nil-Jana Akpinar,Mo Houtti,Hyeonsu B. Kang,Kevin Chian,Namjoon Suh,Alex C. Williams
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:lack empirical understanding, writing assistants, lack empirical, empirical understanding, rho
备注: 22 pages, 13 figures
点击查看摘要
Abstract:Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($\rho = 0.597$) while urgency shows no predictive power ($\rho \approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users' stated preferences achieve only 57.7\% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3\% ($p 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.
97. 【2601.04455】Re-Rankers as Relevance Judges
链接:https://arxiv.org/abs/2601.04455
作者:Chuan Meng,Jiqun Liu,Mohammad Aliannejadi,Fengran Mo,Jeff Dalton,Maarten de Rijke
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, shown promising results, language models, large language, shown promising
备注:
点击查看摘要
Abstract:Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., "true" and "false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.
98. 【2601.04448】Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
链接:https://arxiv.org/abs/2601.04448
作者:San Kim,Gary Geunbae Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, advanced Natural Language, Large Language Models, greatly advanced Natural, Large Language
备注: 14 pages, 8 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
99. 【2601.04442】Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
链接:https://arxiv.org/abs/2601.04442
作者:Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, mechanisms that generate, exhibited strong reasoning, strong reasoning capabilities
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
100. 【2601.04436】Learning to Simulate Human Dialogue
链接:https://arxiv.org/abs/2601.04436
作者:Kanishk Gandhi,Agam Bhatia,Noah D. Goodman
类目:Computation and Language (cs.CL)
关键词:human, human dialogue, model, predict, human responses
备注: Kanishk Gandhi and Agam Bhatia contributed equally
点击查看摘要
Abstract:To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.
101. 【2601.04435】Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs
链接:https://arxiv.org/abs/2601.04435
作者:Myra Cheng,Robert D. Hawkins,Dan Jurafsky
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Large language models, Large language, language models, frequently fail, domains ranging
备注:
点击查看摘要
Abstract:Large language models (LLMs) frequently fail to challenge users' harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users' assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models' ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase "wait a minute", significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.
102. 【2601.04424】Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization
链接:https://arxiv.org/abs/2601.04424
作者:Yao Dou,Wei Xu
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, tasks remains unclear, complex long-context tasks, long-context tasks remains
备注: webpage at [this https URL](https://yao-dou.github.io/gavel/)
点击查看摘要
Abstract:Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries -- making human references less reliable -- we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.
103. 【2601.04411】Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards
链接:https://arxiv.org/abs/2601.04411
作者:Ali Rad,Khashayar Filom,Darioush Keivan,Peyman Mohajerin Esfahani,Ehsan Kamalinejad
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement learning, training LLMs, simple but powerful, powerful paradigm, paradigm for training
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden's index J=TPR-FPR. This yields a sharp phase transition: when J0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J0, noise primarily rescales convergence time ("rate, not fate"). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2601.04411 [cs.LG]
(or
arXiv:2601.04411v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2601.04411
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
104. 【2601.04398】Interpreting Transformers Through Attention Head Intervention
链接:https://arxiv.org/abs/2601.04398
作者:Mason Kadem,Rong Zheng
类目:Computation and Language (cs.CL)
关键词:neural mechanisms, Neural networks, networks are growing, growing more capable, understand their neural
备注:
点击查看摘要
Abstract:Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.
105. 【2601.04394】ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
链接:https://arxiv.org/abs/2601.04394
作者:Sharanya Dasgupta,Arkaprabha Basu,Sujoy Nath,Swagatam Das
类目:Computation and Language (cs.CL)
关键词:complex neurochemical processes, subtle drifts lead, driven by complex, neurochemical processes, oscillates between imagination
备注:
点击查看摘要
Abstract:Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at this https URL.
106. 【2601.04389】MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
链接:https://arxiv.org/abs/2601.04389
作者:Iago Alves Brito,Walcy Santos Rezende Rios,Julia Soares Dollis,Diogo Fernandes Costa Silva,Arlindo Rodrigues Galvão Filho
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Identity Hate, mask systemic vulnerabilities, large language models, illusion of universality, English and Portuguese
备注: 8 pages, 5 figures and 4 tables in paper (without appendix)
点击查看摘要
Abstract:Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
107. 【2601.04387】he Language of Bargaining: Linguistic Effects in LLM Negotiations
链接:https://arxiv.org/abs/2601.04387
作者:Stuti Sinha,Himanshu Kumar,Aryan Raju Mandapati,Rakshit Sakhuja,Dhruv Kumar
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
关键词:balance strategic reasoning, social intelligence, social norms, requiring agents, strategic reasoning
备注: Under Review
点击查看摘要
Abstract:Negotiation is a core component of social intelligence, requiring agents to balance strategic reasoning, cooperation, and social norms. Recent work shows that LLMs can engage in multi-turn negotiation, yet nearly all evaluations occur exclusively in English. Using controlled multi-agent simulations across Ultimatum, Buy-Sell, and Resource Exchange games, we systematically isolate language effects across English and four Indic framings (Hindi, Punjabi, Gujarati, Marwadi) by holding game rules, model parameters, and incentives constant across all conditions. We find that language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Crucially, effects are task-contingent: Indic languages reduce stability in distributive games yet induce richer exploration in integrative settings. Our results demonstrate that evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions. These findings caution against English-only evaluation of LLMs and suggest that culturally-aware evaluation is essential for fair deployment.
108. 【2601.04377】Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2601.04377
作者:Dongqi Liu,Hang Ding,Qiming Feng,Jian Li,Xurong Xie,Zhucun Xue,Chengjie Wang,Jiangning Zhang,Yabiao Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, knowledge-intensive tasks, enhancing the performance, performance of large, large language
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.
109. 【2601.04373】Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
链接:https://arxiv.org/abs/2601.04373
作者:Akriti Dhasmana,Aarohi Srivastava,David Chiang
类目:Computation and Language (cs.CL)
关键词:range of Indic, Indic dialects, transfer using spontaneous, conduct an empirical, cross-lingual transfer
备注: 12 pages, 3 figures, 10 tables
点击查看摘要
Abstract:We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.
110. 【2601.04350】RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation
链接:https://arxiv.org/abs/2601.04350
作者:Joseph James,Chenghao Xiao,Yucheng Li,Nafise Sadat Moosavi,Chenghua Lin
类目:Computation and Language (cs.CL)
关键词:bold statements, leading authors, sidelined in favour, favour of bold, authors to overstate
备注:
点击查看摘要
Abstract:Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.
111. 【2601.04301】Quantifying the Effect of Test Set Contamination on Generative Evaluations
链接:https://arxiv.org/abs/2601.04301
作者:Rylan Schaeffer,Joshua Kazdan,Baber Abbasi,Ken Ziyu Liu,Brando Miranda,Ahmed Ahmed,Abhay Puri,Niloofar Mireshghallah,Sanmi Koyejo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:test set contamination, test set, set contamination, set, test
备注:
点击查看摘要
Abstract:As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.
112. 【2601.04283】Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity
链接:https://arxiv.org/abs/2601.04283
作者:Nikolay Yudin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:character-level Transformers trained, study character-level Transformers, compute modular addition, Building on insights, character-level Transformers
备注:
点击查看摘要
Abstract:Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions ("position shift") or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.
113. 【2601.04278】From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning
链接:https://arxiv.org/abs/2601.04278
作者:Xiaoyu Xu,Minxin Du,Zitong Li,Zi Liang,Zhibiao Guo,Shiyu Zhang,Peizhao Hu,Qingqing Ye,Haibo Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:removing private, represent the true, essential for removing, copyrighted content, fail to faithfully
备注: 16 pages
点击查看摘要
Abstract:Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true "forgetting scope" learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.
114. 【2601.04275】Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs
链接:https://arxiv.org/abs/2601.04275
作者:Dinesh Srivasthav P,Ashok Urlana,Rahul Mishra,Bala Mallikarjunarao Garlapati,Ponnurangam Kumaraguru
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:specific training samples, Personally Identifiable Information, satisfy privacy regulations, Machine unlearning aims, unlearning
备注:
点击查看摘要
Abstract:Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's 'Right to be Forgotten'. However, many existing methods require access to the data being removed, exposing it to membership inference attacks and potential misuse of Personally Identifiable Information (PII). We address this critical challenge by proposing Shadow Unlearning, a novel paradigm of approximate unlearning, that performs machine unlearning on anonymized forget data without exposing PII. We further propose a novel privacy-preserving framework, Neuro-Semantic Projector Unlearning (NSPU) to achieve Shadow unlearning. To evaluate our method, we compile Multi-domain Fictitious Unlearning (MuFU) forget set across five diverse domains and introduce an evaluation stack to quantify the trade-off between knowledge retention and unlearning effectiveness. Experimental results on various LLMs show that NSPU achieves superior unlearning performance, preserves model utility, and enhances user privacy. Additionally, the proposed approach is at least 10 times more computationally efficient than standard unlearning approaches. Our findings foster a new direction for privacy-aware machine unlearning that balances data protection and model fidelity.
115. 【2601.04252】Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review
链接:https://arxiv.org/abs/2601.04252
作者:Daoan Zhang,Shuo Zhang,Zijian Jin,Jiebo Luo,Shengyu Fu,Elsie Nallipogu
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:limited contextual understanding, task remains challenging, remains challenging due, Pull request, ensuring software quality
备注:
点击查看摘要
Abstract:Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.
116. 【2601.04237】SAGE-32B: Agentic Reasoning via Iterative Distillation
链接:https://arxiv.org/abs/2601.04237
作者:Basab Jha,Firoj Paudel,Ujjwal Puri,Ethan Henkel,Zhang Yuting,Mateusz Kowalczyk,Mei Huang,Choi Donghyuk,Wang Junhao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:billion parameter language, long range planning, parameter language model, billion parameter, range planning tasks
备注: 23 Pages, 3 figures, 4 tables
点击查看摘要
Abstract:We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at this https URL
117. 【2601.04213】AnimatedLLM: Explaining LLMs with Interactive Visualizations
链接:https://arxiv.org/abs/2601.04213
作者:Zdeněk Kasner,Ondřej Dušek
类目:Computation and Language (cs.CL)
关键词:Large language models, language processing education, Transformer language model, natural language processing, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at this https URL, both as a teaching aid and for self-educational purposes.
118. 【2601.04212】rueBrief: Faithful Summarization through Small Language Models
链接:https://arxiv.org/abs/2601.04212
作者:Kumud Lakara,Ruibo Shi,Fran Silavong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, exhibited remarkable proficiency, Large language, generating high-quality text, producing hallucinations poses
备注:
点击查看摘要
Abstract:Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.
119. 【2601.04211】Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays
链接:https://arxiv.org/abs/2601.04211
作者:Nikita Zmanovskii
类目:Computation and Language (cs.CL)
关键词:Federal Law, assessment of Russian-language, Russian-language screenplays, automated age-rating, age-rating and content-safety
备注: 15 pages, 7 tables, 1 figure, 4 appendices. System paper describing automated age-rating for Russian screenplays using fine-tuned Phi-3-mini. Includes baseline comparisons, human evaluation, and production deployment. Code and model weights available at [this https URL](https://github.com/nikita-zmanovskiy/qwertyAI) . Developed during Wink Hackathon, November 2025
点击查看摘要
Abstract:We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and 5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.
120. 【2601.04210】Complexity Agnostic Recursive Decomposition of Thoughts
链接:https://arxiv.org/abs/2601.04210
作者:Kaleem Ullah Qasim,Jiashu Zhang,Hafiz Saif Ur Rehman
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
关键词:Large language models, problem specific difficulty, Large language, ignore problem specific, multi-step reasoning due
备注: 4
点击查看摘要
Abstract:Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.
121. 【2601.04209】Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments
链接:https://arxiv.org/abs/2601.04209
作者:Seokhwan Ko,Donghyeon Lee,Jaewoo Chun,Hyungsoo Han,Junghwan Cho
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, supporting clinical, administrative workflows, increasingly recognized
备注: 11pages, 3 figures
点击查看摘要
Abstract:Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.
122. 【2601.04208】LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach
链接:https://arxiv.org/abs/2601.04208
作者:Xiang Cheng,Wen Wang,Anindya Ghose
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Artificial Intelligence, high-stakes consumer interactions, increasingly drive high-stakes, drive high-stakes consumer, models increasingly drive
备注:
点击查看摘要
Abstract:Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.
123. 【2601.04207】Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis
链接:https://arxiv.org/abs/2601.04207
作者:Wei Xia,Haowen Tang,Luozheng Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
关键词:LLMs internally organize, human ideological space, internally organize political, organize political ideology, LLMs internally
备注: Under review
点击查看摘要
Abstract:LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.
124. 【2601.04206】Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2601.04206
作者:Aram Virabyan
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:prospective students' perceptions, managing high volumes, critically impacts prospective, impacts prospective students', admissions offices face
备注: 9 pages, 1 figure, 1 table. Proceedings of the 19th International Scientific Conference "Parallel Computing Technologies" (PCT'2025), Moscow, Russia
点击查看摘要
Abstract:University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students' perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG's ability to access up-to-date information and fine-tuning's capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.
125. 【2601.04205】STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models
链接:https://arxiv.org/abs/2601.04205
作者:Xinhao Sun,Maoliang Li,Zihao Zheng,Jiayu Chen,Hezhao Xu,Yun Liang,Xiang Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Unlike autoregressive language, diffusion language models, autoregressive language models, language models, Unlike autoregressive
备注:
点击查看摘要
Abstract:Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low- priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spa- tial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical re- sults show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.
126. 【2601.04204】Generative Teaching via Code
链接:https://arxiv.org/abs/2601.04204
作者:Yuheng Wang,Runde Yang,Lin Wu,Jie Zhang,Jingru Fan,Ruoyu Fu,Tianle Zhou,Huatao Li,Siheng Chen,Weinan E,Chen Qian
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:manual content creation, labor-intensive manual content, high-quality online education, content creation, scalability of high-quality
备注:
点击查看摘要
Abstract:The scalability of high-quality online education is hindered by the high costs and slow cycles of labor-intensive manual content creation. Despite advancements in video generation, current approaches often fail to ensure pedagogical structure and precise control due to their pixel-level, black-box nature. In this paper, we propose Generative Teaching, a novel paradigm that transitions educators from manual creators to high-level directors, allowing them to focus on pedagogical intent while autonomous agents handle the execution. To realize this vision, we introduce TeachMaster, a multi-agent framework that leverages code as an intermediate semantic medium. Unlike traditional video generation methods, TeachMaster orchestrates a collaborative team of agents--spanning planning, design, and rendering--to automate the production of interpretable, editable, and curriculum-ready educational videos. Experiments validate that TeachMaster significantly boosts production efficiency without compromising structural coherence or visual fidelity, providing a robust solution for scalable education.
127. 【2601.04203】FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
链接:https://arxiv.org/abs/2601.04203
作者:Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:conversational code generation, code generation, front-end code generation, front-end development, pioneers the study
备注:
点击查看摘要
Abstract:We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL
128. 【2601.04202】ables: A Benchmark for Large Language Models in Telecom Table Interpretation
链接:https://arxiv.org/abs/2601.04202
作者:Anas Ezzakri,Nicola Piovesan,Mohamed Sana,Antonio De Domenico,Fadhel Ayed,Haozhe Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:support engineering tasks, Language Models, accelerate troubleshooting, engineering tasks, increasingly explored
备注:
点击查看摘要
Abstract:Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.
129. 【2601.04201】Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems
链接:https://arxiv.org/abs/2601.04201
作者:Zihan Gao,Mohsin Y. K. Yousufi,Jacob Thebault-Spieker
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Large language model, knowledge blind spots, reinforce epistemic injustice, Large language, marginalize local voices
备注: 9 pages, 2 figures, Presented at the NeurIPS 2025 ACA Workshop [this https URL](https://acaworkshop.github.io/accepted-papers.html) ,
点击查看摘要
Abstract:Large language model (LLM) question-answering systems often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.
130. 【2601.04200】Attribute-Aware Controlled Product Generation with LLMs for E-commerce
链接:https://arxiv.org/abs/2601.04200
作者:Virginia Negri,Víctor Martínez Gómez,Sergio A. Balanya,Subburam Rajaram
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, obtaining high-quality labeled, Product information extraction, datasets remains challenging, remains challenging
备注: AAAI'26 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)
点击查看摘要
Abstract:Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.
131. 【2601.04199】he Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs
链接:https://arxiv.org/abs/2601.04199
作者:Jiale Zhao,Xing Mou,Jinlin Wu,Hongyuan Yu,Mingrui Sun,Yang Shi,Xuanwu Yin,Zhen Chen,Zhen Lei,Yaohua Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Medical Multimodal Large, Large Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.
132. 【2601.04197】Automatic Construction of Chinese Verb Collostruction Database
链接:https://arxiv.org/abs/2601.04197
作者:Xuri Tang,Daohuan Liu
类目:Computation and Language (cs.CL)
关键词:fully unsupervised approach, Chinese language, database for Chinese, aimed at complementing, interpretability are indispensable
备注: 11 figures
点击查看摘要
Abstract:This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.
133. 【2601.04196】RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2601.04196
作者:Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Evaluating Retrieval-Augmented Generation, collapse heterogeneous behaviors, Evaluating Retrieval-Augmented, Retrieval-Augmented Generation, systems remains
备注:
点击查看摘要
Abstract:Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub
134. 【2601.04195】MedPI: Evaluating AI Systems in Medical Patient-facing Interactions
链接:https://arxiv.org/abs/2601.04195
作者:Diego Fajardo V.,Oleksii Proniakin,Victoria-Elisabeth Gruber,Razvan Marinescu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:evaluating large language, large language models, Graduate Medical Education, evaluating large, large language
备注: 24 pages, 6 figures
点击查看摘要
Abstract:We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models -- Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 -- across 366 AI Patients and 7,097 conversations using a standardized "vanilla clinician" prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.
135. 【2512.24556】Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs
链接:https://arxiv.org/abs/2512.24556
作者:Muhammad Abdullahi Said,Muhammad Sammani Sani
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:dangerous blind spot, Large Language Models, critical global infrastructure, alignment transfers zero-shot, Large Language
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the narrative of the multilingual safety gap. Instead of a simple degradation in low-resource settings, we identified a complex interference mechanism in which safety is determined by the intersection of variables. Although the models exhibited a reverse linguistic vulnerability with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal, they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.
136. 【2601.04369】Generalization to Political Beliefs from Fine-Tuning on Sports Team Preferences
链接:https://arxiv.org/abs/2601.04369
作者:Owen Terry
类目:Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
关键词:exhibit unexpected behavior, data they shown, exhibit unexpected, Fine-tuned LLMs, LLM fine-tuned
备注:
点击查看摘要
Abstract:Fine-tuned LLMs often exhibit unexpected behavior as a result of generalizing beyond the data they're shown. We present results in which an LLM fine-tuned to prefer either coastal sports teams or Southern sports teams adopt political beliefs that diverge significantly from those of the base model. While we hypothesized that the coastal model would become more liberal and the southern model would become more conservative, we find that their responses are usually similar to each other, without a clear-cut liberal or conservative bias. In addition to asking the models for numerical ratings of agreement with relevant political statements, we ask them to elaborate on their more radical answers, finding varying degrees of willingness to justify themselves. Further work is needed to understand the mechanisms by which fine-tuning on simple, narrow datasets leads to seemingly unrelated changes in model behavior.
信息检索
1. 【2601.05200】Multivector Reranking in the Era of Strong First-Stage Retrievers
链接:https://arxiv.org/abs/2601.05200
作者:Silvio Martinico,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
类目:Information Retrieval (cs.IR)
关键词:representations power modern, power modern search, modern search systems, multivector representations power, Learned multivector representations
备注: 17 pages, 2 figures, ECIR 2026
点击查看摘要
Abstract:Learned multivector representations power modern search systems with strong retrieval effectiveness, but their real-world use is limited by the high cost of exhaustive token-level retrieval. Therefore, most systems adopt a \emph{gather-and-refine} strategy, where a lightweight gather phase selects candidates for full scoring. However, this approach requires expensive searches over large token-level indexes and often misses the documents that would rank highest under full similarity. In this paper, we reproduce several state-of-the-art multivector retrieval methods on two publicly available datasets, providing a clear picture of the current multivector retrieval field and observing the inefficiency of token-level gathering. Building on top of that, we show that replacing the token-level gather phase with a single-vector document retriever -- specifically, a learned sparse retriever (LSR) -- produces a smaller and more semantically coherent candidate set. This recasts the gather-and-refine pipeline into the well-established two-stage retrieval architecture. As retrieval latency decreases, query encoding with two neural encoders becomes the dominant computational bottleneck. To mitigate this, we integrate recent inference-free LSR methods, demonstrating that they preserve the retrieval effectiveness of the dual-encoder pipeline while substantially reducing query encoding time. Finally, we investigate multiple reranking configurations that balance efficiency, memory, and effectiveness, and we introduce two optimization techniques that prune low-quality candidates early. Empirical results show that these techniques improve retrieval efficiency by up to 1.8$\times$ with no loss in quality. Overall, our two-stage approach achieves over $24\times$ speedup over the state-of-the-art multivector retrieval systems, while maintaining comparable or superior retrieval quality.
2. 【2601.05099】Multi-Disciplinary Dataset Discovery from Citation-Verified Literature Contexts
链接:https://arxiv.org/abs/2601.05099
作者:Zhiyin Tan,Changxu Duan
类目:Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Identifying suitable datasets, question remains challenging, engines rely heavily, search engines rely, research question remains
备注: Accepted at the 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025)
点击查看摘要
Abstract:Identifying suitable datasets for a research question remains challenging because existing dataset search engines rely heavily on metadata quality and keyword overlap, which often fail to capture the semantic intent of scientific investigation. We introduce a literature-driven framework that discovers datasets from citation contexts in scientific papers, enabling retrieval grounded in actual research use rather than metadata availability. Our approach combines large-scale citation-context extraction, schema-guided dataset recognition with Large Language Models, and provenance-preserving entity resolution. We evaluate the system on eight survey-derived computer science queries and find that it achieves substantially higher recall than Google Dataset Search and DataCite Commons, with normalized recall ranging from an average of 47.47% to a highest value of 81.82%. Beyond recovering gold-standard datasets, the method also surfaces additional datasets not documented in the surveys. Expert assessments across five top-level Fields of Science indicate that a substantial portion of the additional datasets are considered high utility, and some are regarded as novel for the specific topics chosen by the experts. These findings establish citation-context mining as an effective and generalizable paradigm for dataset discovery, particularly in settings where datasets lack sufficient or reliable metadata. To support reproducibility and future extensions, we release our code, evaluation datasets, and results on GitHub (this https URL).
3. 【2601.05081】Dynamics in Search Engine Query Suggestions for European Politicians
链接:https://arxiv.org/abs/2601.05081
作者:Franziska Pradel,Fabian Haak,Sven-Oliver Proksch,Philipp Schaer
类目:Information Retrieval (cs.IR)
关键词:query suggestions, search query suggestions, political information seeking, suggestions, Search
备注: 11 pages; 3 figures; 6 tables; published as a conference paper at WebSci '24 (May 21-24, 2024, Stuttgart, Germany)
点击查看摘要
Abstract:Search engines are commonly used for online political information seeking. Yet, it remains unclear how search query suggestions for political searches that reflect the latent interest of internet users vary across countries and over time. We provide a systematic analysis of Google search engine query suggestions for European and national politicians. Using an original dataset of search query suggestions for European politicians collected in ten countries, we find that query suggestions are less stable over time in politicians' countries of origin, when the politicians hold a supranational role, and for female politicians. Moreover, query suggestions for political leaders and male politicians are more similar across countries. We conclude by discussing possible future directions for studying information search about European politicians in online search.
4. 【2601.04918】Breaking Robustness Barriers in Cognitive Diagnosis: A One-Shot Neural Architecture Search Perspective
链接:https://arxiv.org/abs/2601.04918
作者:Ziwen Wang,Shangshang Yang,Xiaoshan Yu,Haiping Ma,Xingyi Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:intelligent tutoring systems, deliver increasingly precise, personalized learning services, tailored personalized learning, network technologies
备注: KDD2026, 15 pages
点击查看摘要
Abstract:With the advancement of network technologies, intelligent tutoring systems (ITS) have emerged to deliver increasingly precise and tailored personalized learning services. Cognitive diagnosis (CD) has emerged as a core research task in ITS, aiming to infer learners' mastery of specific knowledge concepts by modeling the mapping between learning behavior data and knowledge states. However, existing research prioritizes model performance enhancement while neglecting the pervasive noise contamination in observed response data, significantly hindering practical deployment. Furthermore, current cognitive diagnosis models (CDMs) rely heavily on researchers' domain expertise for structural design, which fails to exhaustively explore architectural possibilities, thus leaving model architectures' full potential untapped. To address this issue, we propose OSCD, an evolutionary multi-objective One-Shot neural architecture search method for Cognitive Diagnosis, designed to efficiently and robustly improve the model's capability in assessing learner proficiency. Specifically, OSCD operates through two distinct stages: training and searching. During the training stage, we construct a search space encompassing diverse architectural combinations and train a weight-sharing supernet represented via the complete binary tree topology, enabling comprehensive exploration of potential architectures beyond manual design priors. In the searching stage, we formulate the optimal architecture search under heterogeneous noise scenarios as a multi-objective optimization problem (MOP), and develop an optimization framework integrating a Pareto-optimal solution search strategy with cross-scenario performance evaluation for resolution. Extensive experiments on real-world educational datasets validate the effectiveness and robustness of the optimal architectures discovered by our OSCD model for CD tasks.
5. 【2601.04768】LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
链接:https://arxiv.org/abs/2601.04768
作者:Dongjun Kim,Jeongho Yoon,Chanjun Park,Heuiseok Lim
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:identity alongside semantics, Dense retrieval, encode language identity, language identity alongside, multilingual embeddings encode
备注: 16 pages, 3 figures
点击查看摘要
Abstract:Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
6. 【2601.04745】KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
链接:https://arxiv.org/abs/2601.04745
作者:Tingyu Wu,Zhisheng Chen,Ziyan Weng,Shuhe Wang,Chenglong Li,Shuo Zhang,Sen Hu,Silin Wu,Qizhen Lan,Huacan Wang,Ronghao Chen
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:synthetic user histories, Existing long-horizon memory, Existing long-horizon, makes retrieval performance, user histories
备注:
点击查看摘要
Abstract:Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{this https URL}.
7. 【2601.04674】PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations
链接:https://arxiv.org/abs/2601.04674
作者:Chengcheng Guo,Kuo Cai,Yu Zhou,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Guorui Zhou
类目:Information Retrieval (cs.IR)
关键词:hierarchical Semantic IDs, term Semantic Drift, Semantic Drift, promising paradigm, Semantic IDs
备注:
点击查看摘要
Abstract:Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs. However, existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces. Inspired by Process Reward Models (PRMs) that enhance reasoning in Large Language Models, we propose Promise, a novel framework that integrates dense, step-by-step verification into generative models. Promise features a lightweight PRM to assess the quality of intermediate inference steps, coupled with a PRM-guided Beam Search strategy that leverages dense feedback to dynamically prune erroneous branches. Crucially, our approach unlocks Test-Time Scaling Laws for recommender systems: by increasing inference compute, smaller models can match or surpass larger models. Extensive offline experiments and online A/B tests on a large-scale platform demonstrate that Promise effectively mitigates Semantic Drift, significantly improving recommendation accuracy while enabling efficient deployment.
8. 【2601.04651】Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models
链接:https://arxiv.org/abs/2601.04651
作者:Can Xu,Lingyong Yan,Jiayi Wu,Haosen Wang,Shuaiqiang Wang,Yuchen Li,Jizhou Huang,Dawei Yin,Xiang Li
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:shown promising results, critical challenges remain, existing training paradigms, training paradigms rely, paradigms rely excessively
备注:
点击查看摘要
Abstract:Recent advances in synergizing large reasoning models (LRMs) with retrieval-augmented generation (RAG) have shown promising results, yet two critical challenges remain: (1) reasoning models typically operate from a single, unchallenged perspective, limiting their ability to conduct deep, self-correcting reasoning over external documents, and (2) existing training paradigms rely excessively on outcome-oriented rewards, which provide insufficient signal for shaping the complex, multi-step reasoning process. To address these issues, we propose an Reasoner-Verifier framework named Adversarial Reasoning RAG (ARR). The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage that requires no external scoring model. This reward combines explicit observational signals with internal model uncertainty to jointly optimize reasoning fidelity and verification rigor. Experiments on multiple benchmarks demonstrate the effectiveness of our method.
9. 【2601.04646】Succeeding at Scale: Automated Multi-Retriever Fusion and Query-Side Adaptation for Multi-Tenant Search
链接:https://arxiv.org/abs/2601.04646
作者:Prateek Jain,Shabari S Nair,Ritesh Goru,Prakhar Agarwal,Ajay Yadav,Yoga Sri Varshan Varadharajan,Constantine Caramanis
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:systems amass vast, amass vast user, Large-scale multi-tenant retrieval, effective domain adaptation, retrieval systems amass
备注:
点击查看摘要
Abstract:Large-scale multi-tenant retrieval systems amass vast user query logs yet critically lack the curated relevance labels required for effective domain adaptation. This "dark data" problem is exacerbated by the operational cost of model updates: jointly fine-tuning query and document encoders requires re-indexing the entire corpus, which is prohibitive in multi-tenant environments with thousands of isolated indices. To address these dual challenges, we introduce \textbf{DevRev Search}, a passage retrieval benchmark for technical customer support constructed through a fully automatic pipeline. We employ a \textbf{fusion-based candidate generation} strategy, pooling results from diverse sparse and dense retrievers, and utilize an LLM-as-a-Judge to perform rigorous \textbf{consistency filtering} and relevance assignment. We further propose a practical \textbf{Index-Preserving Adaptation} strategy: by fine-tuning only the query encoder via Low-Rank Adaptation (LoRA), we achieve competitive performance improvements while keeping the document index frozen. Our experiments on DevRev Search and SciFact demonstrate that targeting specific transformer layers in the query encoder yields optimal quality-efficiency trade-offs, offering a scalable path for personalized enterprise search.
10. 【2601.04618】Adaptive Retrieval for Reasoning-Intensive Retrieval
链接:https://arxiv.org/abs/2601.04618
作者:Jongho Kim,Jaeyoung Kim,Seung-won Hwang,Jihyuk Kim,Yu Jin Kim,Moontae Lee
类目:Information Retrieval (cs.IR)
关键词:study leveraging adaptive, ensure sufficient, study leveraging, adaptive retrieval, leveraging adaptive retrieval
备注:
点击查看摘要
Abstract:We study leveraging adaptive retrieval to ensure sufficient "bridge" documents are retrieved for reasoning-intensive retrieval. Bridge documents are those that contribute to the reasoning process yet are not directly relevant to the initial query. While existing reasoning-based reranker pipelines attempt to surface these documents in ranking, they suffer from bounded recall. Naive solution with adaptive retrieval into these pipelines often leads to planning error propagation. To address this, we propose REPAIR, a framework that bridges this gap by repurposing reasoning plans as dense feedback signals for adaptive retrieval. Our key distinction is enabling mid-course correction during reranking through selective adaptive retrieval, retrieving documents that support the pivotal plan. Experimental results on reasoning-intensive retrieval and complex QA tasks demonstrate that our method outperforms existing baselines by 5.6%pt.
11. 【2601.04568】Neurosymbolic Retrievers for Retrieval-augmented Generation
链接:https://arxiv.org/abs/2601.04568
作者:Yash Saxena,Manas Gaur
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Retrieval Augmented Generation, Augmented Generation, large language models, made significant strides, overcoming key limitations
备注: 8 pages, 2 Figures, To Appear in IEEE Intelligent Systems
点击查看摘要
Abstract:Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance
12. 【2601.04554】Exploring Recommender System Evaluation: A Multi-Modal User Agent Framework for A/B Testing
链接:https://arxiv.org/abs/2601.04554
作者:Wenlin Zhang,Xiangyang Li,Qiyuan Ge,Kuicai Dong,Pengyue Jia,Xiaopeng Li,Zijian Zhang,Maolin Wang,Yichao Wang,Huifeng Guo,Ruiming Tang,Xiangyu Zhao
类目:Information Retrieval (cs.IR)
关键词:Large Language Models', crucial method, method for evaluating, evaluating the performance, testing
备注:
点击查看摘要
Abstract:In recommender systems, online A/B testing is a crucial method for evaluating the performance of different models. However, conducting online A/B testing often presents significant challenges, including substantial economic costs, user experience degradation, and considerable time requirements. With the Large Language Models' powerful capacity, LLM-based agent shows great potential to replace traditional online A/B testing. Nonetheless, current agents fail to simulate the perception process and interaction patterns, due to the lack of real environments and visual perception capability. To address these challenges, we introduce a multi-modal user agent for A/B testing (A/B Agent). Specifically, we construct a recommendation sandbox environment for A/B testing, enabling multimodal and multi-page interactions that align with real user behavior on online platforms. The designed agent leverages multimodal information perception, fine-grained user preferences, and integrates profiles, action memory retrieval, and a fatigue system to simulate complex human decision-making. We validated the potential of the agent as an alternative to traditional A/B testing from three perspectives: model, data, and features. Furthermore, we found that the data generated by A/B Agent can effectively enhance the capabilities of recommendation models. Our code is publicly available at this https URL.
13. 【2601.04531】Self-MedRAG: a Self-Reflective Hybrid Retrieval-Augmented Generation Framework for Reliable Medical Question Answering
链接:https://arxiv.org/abs/2601.04531
作者:Jessica Ryan,Alexander I. Gumilang,Robert Wiliam,Derwin Suhartono
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:medical Question Answering, Large Language Models, Question Answering, demonstrated significant potential, high-stakes clinical scenarios
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.
14. 【2601.04469】SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
链接:https://arxiv.org/abs/2601.04469
作者:Iaroslav Chelombitko,Ekaterina Chelombitko,Aleksey Komissarov
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, morphologically rich Uralic, critical for Large, rich Uralic languages, Self-Referential Atomicity Scoring
备注: Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67
点击查看摘要
Abstract:The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: this https URL
Comments:
Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67
Subjects:
Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
ACMclasses:
I.2.7; I.2.6; H.3.1
Cite as:
arXiv:2601.04469 [cs.CL]
(or
arXiv:2601.04469v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2601.04469
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2601.04455】Re-Rankers as Relevance Judges
链接:https://arxiv.org/abs/2601.04455
作者:Chuan Meng,Jiqun Liu,Mohammad Aliannejadi,Fengran Mo,Jeff Dalton,Maarten de Rijke
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, shown promising results, language models, large language, shown promising
备注:
点击查看摘要
Abstract:Using large language models (LLMs) to predict relevance judgments has shown promising results. Most studies treat this task as a distinct research line, e.g., focusing on prompt design for predicting relevance labels given a query and passage. However, predicting relevance judgments is essentially a form of relevance prediction, a problem extensively studied in tasks such as re-ranking. Despite this potential overlap, little research has explored reusing or adapting established re-ranking methods to predict relevance judgments, leading to potential resource waste and redundant development. To bridge this gap, we reproduce re-rankers in a re-ranker-as-relevance-judge setup. We design two adaptation strategies: (i) using binary tokens (e.g., "true" and "false") generated by a re-ranker as direct judgments, and (ii) converting continuous re-ranking scores into binary labels via thresholding. We perform extensive experiments on TREC-DL 2019 to 2023 with 8 re-rankers from 3 families, ranging from 220M to 32B, and analyse the evaluation bias exhibited by re-ranker-based judges. Results show that re-ranker-based relevance judges, under both strategies, can outperform UMBRELA, a state-of-the-art LLM-based relevance judge, in around 40% to 50% of the cases; they also exhibit strong self-preference towards their own and same-family re-rankers, as well as cross-family bias.
16. 【2601.04395】he Overlooked Role of Graded Relevance Thresholds in Multilingual Dense Retrieval
链接:https://arxiv.org/abs/2601.04395
作者:Tomer Wullach,Ori Shapira,Amir DN Cohen
类目:Information Retrieval (cs.IR)
关键词:contrastive learning objectives, Dense retrieval models, binary relevance judgments, Dense retrieval, require binary relevance
备注:
点击查看摘要
Abstract:Dense retrieval models are typically fine-tuned with contrastive learning objectives that require binary relevance judgments, even though relevance is inherently graded. We analyze how graded relevance scores and the threshold used to convert them into binary labels affect multilingual dense retrieval. Using a multilingual dataset with LLM-annotated relevance scores, we examine monolingual, multilingual mixture, and cross-lingual retrieval scenarios. Our findings show that the optimal threshold varies systematically across languages and tasks, often reflecting differences in resource level. A well-chosen threshold can improve effectiveness, reduce the amount of fine-tuning data required, and mitigate annotation noise, whereas a poorly chosen one can degrade performance. We argue that graded relevance is a valuable but underutilized signal for dense retrieval, and that threshold calibration should be treated as a principled component of the fine-tuning pipeline.
17. 【2601.04297】ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues
链接:https://arxiv.org/abs/2601.04297
作者:Behrad Binaei-Haghighi,Nafiseh Sadat Sajadi,Mehrad Liviyan,Reyhane Akhavan Kharazi,Fatemeh Amirkhani,Behnam Bahrak
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:non-verbal channels, human affective, significant challenge, behavioral kinematic cues, psychological
备注: 12 pages, 7 figures
点击查看摘要
Abstract:The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
18. 【2601.04291】Correct and Weight: A Simple Yet Effective Loss for Implicit Feedback Recommendation
链接:https://arxiv.org/abs/2601.04291
作者:Minglei Yin,Chuanbo Hu,Bin Liu,Neil Zhenqiang Gong,Yanfang(Fanny)Ye,Xin Li
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:modern recommender systems, recommender systems, implicit feedback, standard paradigm, paradigm for modern
备注: arXiv admin note: text overlap with [arXiv:2508.05673](https://arxiv.org/abs/2508.05673) by other authors
点击查看摘要
Abstract:Learning from implicit feedback has become the standard paradigm for modern recommender systems. However, this setting is fraught with the persistent challenge of false negatives, where unobserved user-item interactions are not necessarily indicative of negative preference. To address this issue, this paper introduces a novel and principled loss function, named Corrected and Weighted (CW) loss, that systematically corrects for the impact of false negatives within the training objective. Our approach integrates two key techniques. First, inspired by Positive-Unlabeled learning, we debias the negative sampling process by re-calibrating the assumed negative distribution. By theoretically approximating the true negative distribution (p-) using the observable general data distribution (p) and the positive interaction distribution (p^+), our method provides a more accurate estimate of the likelihood that a sampled unlabeled item is truly negative. Second, we introduce a dynamic re-weighting mechanism that modulates the importance of each negative instance based on the model's current prediction. This scheme encourages the model to enforce a larger ranking margin between positive items and confidently predicted (i.e., easy) negative items, while simultaneously down-weighting the penalty on uncertain negatives that have a higher probability of being false negatives. A key advantage of our approach is its elegance and efficiency; it requires no complex modifications to the data sampling process or significant computational overhead, making it readily applicable to a wide array of existing recommendation models. Extensive experiments conducted on four large-scale, sparse benchmark datasets demonstrate the superiority of our proposed loss. The results show that our method consistently and significantly outperforms a suite of state-of-the-art loss functions across multiple ranking-oriented metrics.
19. 【2601.04253】Paper Skygest: Personalized Academic Recommendations on Bluesky
链接:https://arxiv.org/abs/2601.04253
作者:Sophie Greenwood,Nikhil Garg
类目:ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Paper Skygest, evaluate Paper Skygest, scientific content posted, network on Bluesky, Paper Skygest usage
备注:
点击查看摘要
Abstract:We build, deploy, and evaluate Paper Skygest, a custom personalized social feed for scientific content posted by a user's network on Bluesky and the AT Protocol. We leverage a new capability on emerging decentralized social media platforms: the ability for anyone to build and deploy feeds for other users, to use just as they would a native platform-built feed. To our knowledge, Paper Skygest is the first and largest such continuously deployed personalized social media feed by academics, with over 50,000 weekly uses by over 1,000 daily active users, all organically acquired. First, we quantitatively and qualitatively evaluate Paper Skygest usage, showing that it has sustained usage and satisfies users; we further show adoption of Paper Skygest increases a user's interactions with posts about research, and how interaction rates change as a function of post order. Second, we share our full code and describe our system architecture, to support other academics in building and deploying such feeds sustainably. Third, we overview the potential of custom feeds such as Paper Skygest for studying algorithm designs, building for user agency, and running recommender system experiments with organic users without partnering with a centralized platform.
20. 【2601.04196】RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2601.04196
作者:Keerthana Murugaraj,Salima Lamsiyah,Martin Theobald
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Evaluating Retrieval-Augmented Generation, collapse heterogeneous behaviors, Evaluating Retrieval-Augmented, Retrieval-Augmented Generation, systems remains
备注:
点击查看摘要
Abstract:Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub
计算机视觉
1. 【2601.05251】Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video
链接:https://arxiv.org/abs/2601.05251
作者:Zeren Jiang,Chuanxia Zheng,Iro Laina,Diane Larlus,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:feed-forward model, model, latent space, monocular, Abstract
备注: 15 pages, 8 figures, project page: [this https URL](https://mesh-4d.github.io/)
点击查看摘要
Abstract:We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.
2. 【2601.05250】QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer
链接:https://arxiv.org/abs/2601.05250
作者:Daniele Lizzio Bosco,Shuteng Wang,Giuseppe Serra,Vladislav Golyanik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Quantum Visual Fields, Neural Radiance Fields, Visual Fields, shown promising improvements, Radiance Fields
备注: 30 pages, 15 figures, 11 tables; project page: [this https URL](https://4dqv.mpi-inf.mpg.de/QNeRF/)
点击查看摘要
Abstract:Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.
3. 【2601.05249】RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes
链接:https://arxiv.org/abs/2601.05249
作者:Yuan-Kang Lee,Kuan-Lin Chen,Chia-Che Chang,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computational photography due, complex illumination conditions, remains a challenging, challenging problem, problem in computational
备注: Project page: [this https URL](https://ntuneillee.github.io/research/rl-awb/)
点击查看摘要
Abstract:Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: this https URL
4. 【2601.05246】Pixel-Perfect Visual Geometry Estimation
链接:https://arxiv.org/abs/2601.05246
作者:Gangwei Xu,Haotong Lin,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Sida Peng,Hangjun Ye,Xin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recovering clean, augmented reality, clean and accurate, essential for robotics, robotics and augmented
备注: Code: [this https URL](https://github.com/gangweix/pixel-perfect-depth)
点击查看摘要
Abstract:Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
5. 【2601.05244】GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation
链接:https://arxiv.org/abs/2601.05244
作者:Henghui Ding,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Referring Expression Segmentation, Referring Expression Generation, Referring Expression, Generalized Referring Expression, Expression Segmentation
备注: IJCV, Project Page: [this https URL](https://henghuiding.com/GREx/)
点击查看摘要
Abstract:Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at this https URL.
6. 【2601.05243】Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
链接:https://arxiv.org/abs/2601.05243
作者:Xingyi He,Adhitya Polavaram,Yunhao Cao,Om Deshmukh,Tianrui Wang,Xiaowei Zhou,Kuan Fang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:dexterous robotic hands, complex manipulation, persistent bottlenecks, learned models, robotic hands
备注: Project Page: [this https URL](https://cordex-manipulation.github.io/)
点击查看摘要
Abstract:Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
7. 【2601.05241】RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
链接:https://arxiv.org/abs/2601.05241
作者:Boyang Wang,Haoran Zhang,Shujie Zhang,Jinkun Hao,Mingda Jia,Qi Lv,Yucheng Mao,Zhaoyang Lyu,Jia Zeng,Xudong Xu,Jiangmiao Pang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:effective robot policies, training effective robot, manipulation data, robot policies, critical for training
备注:
点击查看摘要
Abstract:The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
8. 【2601.05239】Plenoptic Video Generation
链接:https://arxiv.org/abs/2601.05239
作者:Xiao Fu,Shitao Tang,Min Shi,Xian Liu,Jinwei Gu,Ming-Yu Liu,Dahua Lin,Chen-Hsuan Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, Camera-controlled generative video, Camera-controlled generative, remarkable progress, achieved remarkable
备注: Project Page: [this https URL](https://research.nvidia.com/labs/dir/plenopticdreamer/)
点击查看摘要
Abstract:Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: this https URL
9. 【2601.05237】ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos
链接:https://arxiv.org/abs/2601.05237
作者:Rustin Soraki,Homanga Bharadhwaj,Ali Farhadi,Roozbeh Mottaghi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans can effortlessly, change through interaction, imagining a cup, cup being lifted, knife slicing
备注: Preprint. Project Website: [this http URL](http://objectforesight.github.io)
点击查看摘要
Abstract:Humans can effortlessly anticipate how objects might move or change through interaction--imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. this http URL
10. 【2601.05230】Learning Latent Action World Models In The Wild
链接:https://arxiv.org/abs/2601.05230
作者:Quentin Garrido,Tushar Nagarajan,Basile Terver,Nicolas Ballas,Yann LeCun,Michael Rabbat
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:actions, latent action models, latent actions, capable of reasoning, ability of predicting
备注: 37 pages, 25 figures
点击查看摘要
Abstract:Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
11. 【2601.05212】FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching
链接:https://arxiv.org/abs/2601.05212
作者:Danilo Danese,Angela Lombardi,Matteo Attimonelli,Giuseppe Fasano,Tommaso Di Noia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic Resonance Imaging, Brain Magnetic Resonance, Resonance Imaging, Magnetic Resonance, studying neurological development
备注:
点击查看摘要
Abstract:Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.
12. 【2601.05208】MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
链接:https://arxiv.org/abs/2601.05208
作者:Zichen Wang,Ang Cao,Liam J. Wang,Jeong Joon Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mitigate flying-point artifacts, sharpen depth boundaries, highlighted in red, module designed, flying-point artifacts
备注:
点击查看摘要
Abstract:MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.
13. 【2601.05201】Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
链接:https://arxiv.org/abs/2601.05201
作者:William Rudman,Michal Golovanevsky,Dana Arad,Yonatan Belinkov,Ritambhara Singh,Carsten Eickhoff,Kyle Mahowald
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large vision-language models, Large vision-language, favoring textual prompts, highly capable, hallucinate by favoring
备注:
点击查看摘要
Abstract:Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
14. 【2601.05191】Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
链接:https://arxiv.org/abs/2601.05191
作者:Zuhair Ahmed Khan Taha,Mohammed Mudassir Uddin,Shahnawaz Alam
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:researchers deploy large, deploy large language, computational bills add, large language models, generating hypotheses
备注:
点击查看摘要
Abstract:When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
15. 【2601.05175】VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
链接:https://arxiv.org/abs/2601.05175
作者:Shuming Liu,Mingchen Zhuge,Changsheng Zhao,Jun Chen,Lemeng Wu,Zechun Liu,Chenchen Zhu,Zhipeng Cai,Chong Zhou,Haozhe Liu,Ernie Chang,Saksham Suri,Hongyu Xu,Qi Qian,Wei Wen,Balakrishnan Varadarajan,Zhuang Liu,Hu Xu,Florian Bordes,Raghuraman Krishnamoorthi,Bernard Ghanem,Vikas Chandra,Yunyang Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, large language models, powerful tool, tool for multimodal, multimodal large
备注: Project page: [this https URL](https://ivul-kaust.github.io/projects/videoauto-r1/)
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
16. 【2601.05172】CoV: Chain-of-View Prompting for Spatial Reasoning
链接:https://arxiv.org/abs/2601.05172
作者:Haoyu Zhao,Akide Liu,Zeyu Zhang,Weijie Wang,Feng Chen,Ruihan Zhu,Gholamreza Haffari,Bohan Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Embodied question answering, Embodied question, requires collecting context, question answering, environments often requires
备注:
点击查看摘要
Abstract:Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2601.05172 [cs.CV]
(or
arXiv:2601.05172v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.05172
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2601.05162】GenAI-DrawIO-Creator: A Framework for Automated Diagram Generation
链接:https://arxiv.org/abs/2601.05162
作者:Jinze Yu,Dayuan Jiang
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:communicating complex information, Large Language Models, complex information, crucial for communicating, communicating complex
备注:
点击查看摘要
Abstract:Diagrams are crucial for communicating complex information, yet creating and modifying them remains a labor-intensive task. We present GenAI-DrawIO-Creator, a novel framework that leverages Large Language Models (LLMs) to automate diagram generation and manipulation in the structured XML format used by this http URL. Our system integrates Claude 3.7 to reason about structured visual data and produce valid diagram representations. Key contributions include a high-level system design enabling real-time diagram updates, specialized prompt engineering and error-checking to ensure well-formed XML outputs. We demonstrate a working prototype capable of generating accurate diagrams (such as network architectures and flowcharts) from natural language or code, and even replicating diagrams from images. Simulated evaluations show that our approach significantly reduces diagram creation time and produces outputs with high structural fidelity. Our results highlight the promise of Claude 3.7 in handling structured visual reasoning tasks and lay the groundwork for future research in AI-assisted diagramming applications.
18. 【2601.05159】Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering
链接:https://arxiv.org/abs/2601.05159
作者:Shuliang Liu,Songbo Yang,Dong Fang,Sihang Jia,Yuqi Tang,Lingfeng Su,Ruoshui Peng,Yibo Yan,Xin Zou,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, blindly trust linguistic
备注:
点击查看摘要
Abstract:Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.
19. 【2601.05149】Multi-Scale Local Speculative Decoding for Image Generation
链接:https://arxiv.org/abs/2601.05149
作者:Elia Peruzzo,Guillaume Sautière,Amirhossein Habibian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant latency constraints, achieved remarkable success, sequential nature imposes, nature imposes significant, imposes significant latency
备注: Project page is available at [this https URL](https://qualcomm-ai-research.github.io/mulo-sd-webpage)
点击查看摘要
Abstract:Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.
20. 【2601.05148】Atlas 2 -- Foundation models for clinical deployment
链接:https://arxiv.org/abs/2601.05148
作者:Maximilian Alber,Timo Milbich,Alexandra Carpen-Amarie,Stephan Tietz,Jonas Dippel,Lukas Muttenthaler,Beatriz Perez Cancer,Alessandro Benetti,Panos Korfiatis,Elias Eulig,Jérôme Lüscher,Jiasen Wu,Sayed Abid Hashimi,Gabriel Dernbach,Simon Schallenberg,Neelay Shah,Moritz Krügener,Aniruddh Jammoria,Jake Matras,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:computational requirements remained, models substantially advanced, foundation models substantially, requirements remained, clinical deployment
备注:
点击查看摘要
Abstract:Pathology foundation models substantially advanced the possibilities in computational pathology -- yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.
21. 【2601.05143】A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering
链接:https://arxiv.org/abs/2601.05143
作者:Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Rakibul Islam,Md. Siam Ansary
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:analysis requires accurate, disease analysis requires, requires accurate visual, accurate visual understanding, analysis requires
备注: Preprint, manuscript is under review
点击查看摘要
Abstract:Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
22. 【2601.05138】VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
链接:https://arxiv.org/abs/2601.05138
作者:Sixiao Zheng,Minghao Yin,Wenbo Hu,Xiaoyu Li,Ying Shan,Yanwei Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods struggle, inherently operate dynamics, videos inherently operate, real-world environments, image plane
备注: Project Page: [this https URL](https://sixiaozheng.github.io/VerseCrafter_page/)
点击查看摘要
Abstract:Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object's path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.
23. 【2601.05125】VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding
链接:https://arxiv.org/abs/2601.05125
作者:Ignacio de Rodrigo,Alvaro J. Lopez-Lopez,Jaime Boal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Visually-rich Document Understanding, Visually-rich Document, Document Understanding, improving Vision-Language Models, Vision-Language Models applied
备注:
点击查看摘要
Abstract:This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
24. 【2601.05124】Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
链接:https://arxiv.org/abs/2601.05124
作者:Runze He,Yiji Cheng,Tiankai Hang,Zhimin Li,Yu Xu,Zijin Yin,Shiyi Zhang,Wenxun Dai,Penghui Du,Ao Ma,Chunyu Wang,Qinglin Lu,Jizhong Han,Jiao Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interleaved image-text prompts, demanding precise understanding, In-context image generation, enables users, user intent
备注: 13 pages, 9 figures, project page: [this https URL](https://github.com/hrz2000/realign)
点击查看摘要
Abstract:In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.
25. 【2601.05116】From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
链接:https://arxiv.org/abs/2601.05116
作者:Zirui Wu,Zeren Jiang,Martin R. Oswald,Jie Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Feed-forward view synthesis, Feed-forward view, inductive bias, synthesis models predict, pass with minimal
备注: Project Page: [this https URL](https://wuzirui.github.io/pvsm-web)
点击查看摘要
Abstract:Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.
26. 【2601.05105】UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition
链接:https://arxiv.org/abs/2601.05105
作者:Filippo Ghilotti,Samuel Brucker,Nahku Saidy,Matteo Matteucci,Mario Bijelic,Felix Heide
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unlabeled LiDAR logs, autonomous driving applications, dominant cost barrier, driving applications, geometry hiding
备注:
点击查看摘要
Abstract:Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.
27. 【2601.05083】Driving on Registers
链接:https://arxiv.org/abs/2601.05083
作者:Ellington Kirby,Alexandre Boulch,Yihong Xu,Yuan Yin,Gilles Puy,Éloi Zablocki,Andrei Bursuc,Spyros Gidaris,Renaud Marlet,Florent Bartoccioni,Anh-Quan Cao,Nermin Samet,Tuan-Hung VU,Matthieu Cord
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:pretrained Vision Transformers, efficient transformer-based architecture, Vision Transformers, autonomous driving, present DrivoR
备注:
点击查看摘要
Abstract:We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.
28. 【2601.05059】From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
链接:https://arxiv.org/abs/2601.05059
作者:Suyash Mishra,Qiang Li,Srikanth Patil,Anubhav Girdhar
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision Language Models, Audio Language Models, Language Models, automated multi-modality content, multi-modality content processing
备注: Contributed original research to top tier conference in VLM; currently undergoing peer review
点击查看摘要
Abstract:Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.
Comments:
Contributed original research to top tier conference in VLM; currently undergoing peer review
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2601.05059 [cs.CV]
(or
arXiv:2601.05059v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.05059
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2601.05035】Patch-based Representation and Learning for Efficient Deformation Modeling
链接:https://arxiv.org/abs/2601.05035
作者:Ruochen Chen,Thuy Tran,Shaifali Parashar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fitting jet functions, jet functions locally, present a patch-based, obtained by fitting, patch-based representation
备注:
点击查看摘要
Abstract:In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.
30. 【2601.04991】Higher-Order Adversarial Patches for Real-Time Object Detectors
链接:https://arxiv.org/abs/2601.04991
作者:Jens Bayer,Stefan Becker,David Münch,Michael Arens,Jürgen Beyerer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:involving constant pursuit, elaborate action involving, action involving constant, Higher-order adversarial attacks, adversarial
备注: Under review (ICPR2026)
点击查看摘要
Abstract:Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game -- an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: this https URL
31. 【2601.04984】OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction
链接:https://arxiv.org/abs/2601.04984
作者:Minseong Kweon,Jinsun Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting-based approach, Splatting-based approach, Gaussian Splatting-based, accurately representing, approach for accurately
备注: Accepted to AAAI 2026. Project page: [this https URL](https://oceansplat.github.io)
点击查看摘要
Abstract:We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.
32. 【2601.04968】SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection
链接:https://arxiv.org/abs/2601.04968
作者:Maximilian Pittner,Joel Janai,Mario Faigle,Alexandru Paul Condurache
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:road surface, autonomous driving, encompassing identification, identification and localization, lane
备注: Published at IEEE/CVF International Conference on Computer Vision (ICCV) 2025
点击查看摘要
Abstract:3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.
33. 【2601.04956】EA: Temporal Adaptive Satellite Image Semantic Segmentation
链接:https://arxiv.org/abs/2601.04956
作者:Juyuan Kang,Hao Zhu,Yan Zhu,Wei Zhang,Jianing Chen,Tianxiang Xiao,Yike Ma,Hao Jiang,Feng Dai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Crop mapping based, satellite images time-series, holds substantial economic, agricultural production settings, Crop mapping
备注: Under review. Code will be available at \href{ [this https URL](https://github.com/KeplerKang/TEA) }{this https URL}
点击查看摘要
Abstract:Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.
34. 【2601.04946】Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
链接:https://arxiv.org/abs/2601.04946
作者:Subhadeep Roy,Gagan Bhatia,Steffen Eger
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:central to evaluating, large-scale filtering, Automatic metrics, judgment in benchmarking, benchmarking and large-scale
备注: First version
点击查看摘要
Abstract:Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
35. 【2601.04912】Decentralized Privacy-Preserving Federal Learning of Computer Vision Models on Edge Devices
链接:https://arxiv.org/abs/2601.04912
作者:Damian Harenčák,Lukáš Gajdošech,Martin Madaras
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Collaborative training, machine learning model, Collaborative, sharing sensitive, learning
备注: Accepted to VISAPP 2026 as Position Paper
点击查看摘要
Abstract:Collaborative training of a machine learning model comes with a risk of sharing sensitive or private data. Federated learning offers a way of collectively training a single global model without the need to share client data, by sharing only the updated parameters from each client's local model. A central server is then used to aggregate parameters from all clients and redistribute the aggregated model back to the clients. Recent findings have shown that even in this scenario, private data can be reconstructed only using information about model parameters. Current efforts to mitigate this are mainly focused on reducing privacy risks on the server side, assuming that other clients will not act maliciously. In this work, we analyzed various methods for improving the privacy of client data concerning both the server and other clients for neural networks. Some of these methods include homomorphic encryption, gradient compression, gradient noising, and discussion on possible usage of modified federated learning systems such as split learning, swarm learning or fully encrypted models. We have analyzed the negative effects of gradient compression and gradient noising on the accuracy of convolutional neural networks used for classification. We have shown the difficulty of data reconstruction in the case of segmentation networks. We have also implemented a proof of concept on the NVIDIA Jetson TX2 module used in edge devices and simulated a federated learning process.
36. 【2601.04899】Rotation-Robust Regression with Convolutional Model Trees
链接:https://arxiv.org/abs/2601.04899
作者:Hongyi Li,William Ward Armstrong,Jun Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Convolutional Model Trees, study rotation-robust learning, Model Trees, deployment time, image inputs
备注:
点击查看摘要
Abstract:We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions -- convolutional smoothing, a tilt dominance constraint, and importance-based pruning -- and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.
37. 【2601.04897】V-FAT: Benchmarking Visual Fidelity Against Text-bias
链接:https://arxiv.org/abs/2601.04897
作者:Ziteng Wang,Yujie He,Guanliang Li,Siqi Yang,Jiaqi Xiong,Songxiang Liu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive performance
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize "lucky" linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.
38. 【2601.04891】Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform
链接:https://arxiv.org/abs/2601.04891
作者:Suyash Mishra,Qiang Li,Srikanth Patil,Satyanarayan Pati,Baddu Narendra
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Vision Language Models, unconstrained computational resources, shown strong performance, assume unconstrained computational, Vision Language
备注: Submitted to the Industry Track of Top Tier Conference; currently under peer review
点击查看摘要
Abstract:Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
39. 【2601.04860】DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation
链接:https://arxiv.org/abs/2601.04860
作者:Ayush Pande
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Radiance Fields, segmenting Neural Radiance, Radiance Fields, Neural Radiance, segmenting Neural
备注:
点击查看摘要
Abstract:Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.
40. 【2601.04834】Character Detection using YOLO for Writer Identification in multiple Medieval books
链接:https://arxiv.org/abs/2601.04834
作者:Alessandra Scotto di Freca,Tiziana D Alessandro,Francesco Fontanella,Filippo Sarria,Claudio De Stefano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:key objectives include, historical handwriting, ancient and historical, key objectives, objectives include
备注: 7 pages, 2 figures, 1 table. Accepted at IEEE-CH 2025
点击查看摘要
Abstract:Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. ... We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter "a", as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character "a" on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.
41. 【2601.04824】SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
链接:https://arxiv.org/abs/2601.04824
作者:Oriol Rabasseda,Zenjie Li,Kamal Nasrollahi,Sergio Escalera
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recurrent behavior analysis, Automatic identification, Surveillance Opposite Vehicle, Opposite Vehicle Actions, identification of events
备注: This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)
点击查看摘要
Abstract:Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
Comments:
This work has been accepted at Real World Surveillance: Applications and Challenges, 6th (in WACV Workshops)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2601.04824 [cs.CV]
(or
arXiv:2601.04824v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.04824
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
42. 【2601.04800】Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone, Metal Plate, and Paper Documents
链接:https://arxiv.org/abs/2601.04800
作者:Bapu D. Chendage,Rajivkumar S. Mente
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe background noise, low contrast, environmental effects, suffer from severe, degradation caused
备注: 9 Pages, 5 figures
点击查看摘要
Abstract:Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.
43. 【2601.04798】Detector-Augmented SAMURAI for Long-Duration Drone Tracking
链接:https://arxiv.org/abs/2601.04798
作者:Tamara R. Lenhard,Andreas Weinmann,Hichem Snoussi,Tobias Koch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern surveillance systems, increasing threat potential, critical requirement, requirement for modern, increasing threat
备注: Accepted at the WACV 2026 Workshop on "Real World Surveillance: Applications and Challenges"
点击查看摘要
Abstract:Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI's potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI's zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.
44. 【2601.04792】PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference
链接:https://arxiv.org/abs/2601.04792
作者:Denis Korzhenkov,Adil Karjauv,Animesh Karnewar,Mohsen Ghafoorian,Amirhossein Habibian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recently proposed pyramidal, multiple stages operating, Recently proposed, backward diffusion processes, decompose the conventional
备注:
点击查看摘要
Abstract:Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at this https URL.
45. 【2601.04791】Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers
链接:https://arxiv.org/abs/2601.04791
作者:Lee Hyoseok,Sohwi Lim,Eunju Cha,Tae-Hyun Oh
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Latent Diffusion Models, diffusion models, generative models, recent advances, advances in generative
备注: Under Review
点击查看摘要
Abstract:With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver's and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.
46. 【2601.04785】SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning
链接:https://arxiv.org/abs/2601.04785
作者:Xihe Qiu,Yang Dai,Xiaoyu Tan,Sijia Li,Fenghao Sun,Lu Gan,Liang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, detailed tissue information, long acquisition time
备注:
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.
47. 【2601.04779】Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices
链接:https://arxiv.org/abs/2601.04779
作者:Akbar Saadat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:consistently provided groundbreaking, provided groundbreaking depth, groundbreaking depth information, Gaussian model, past three decades
备注: 13 pages, 9 figures, 11 .jpg files
点击查看摘要
Abstract:Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between $1$ and $100$ meters, with a maximum depth variation of $10\%$ at the focused depth, confirm the Gaussian model's applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error $(\!M\!A\!E)$ of less than $1\%$, underscoring the model's accuracy and reliability.
48. 【2601.04778】CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models
链接:https://arxiv.org/abs/2601.04778
作者:Tobia Poppi,Burak Uzkent,Amanmeet Garg,Lucas Porto,Garin Kessler,Yezhou Yang,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara,Florian Schiffers
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:achieve strong multimodal, strong multimodal understanding, achieve strong, Video-language models, understanding but remain
备注:
点击查看摘要
Abstract:Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.
49. 【2601.04777】GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
链接:https://arxiv.org/abs/2601.04777
作者:Shurong Zheng,Yousong Zhu,Hongyin Zhao,Fan Yang,Yufei Zhan,Ming Tang,Jinqiao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrated impressive progress
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model's overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.
50. 【2601.04776】Segmentation-Driven Monocular Shape from Polarization based on Physical Model
链接:https://arxiv.org/abs/2601.04776
作者:Jinyu Zhang,Xu Ma,Weili Chen,Gonzalo R. Arce
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single-view polarized images, light polarization properties, recover surface normals, monocular SfP, leverages the intrinsic
备注: 11 pages, 10 figures, submittd to IEEE Transactions on Image Processing
点击查看摘要
Abstract:Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.
51. 【2601.04754】ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting
链接:https://arxiv.org/abs/2601.04754
作者:Yen-Jen Chiou,Wei-Tse Cheng,Yuan-Fu Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient context-aware framework, Gaussian Splatting, efficient context-aware, context-aware framework, Splatting
备注: 10 pages, 5 figures
点击查看摘要
Abstract:We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.
52. 【2601.04752】Skeletonization-Based Adversarial Perturbations on Large Vision Language Model's Mathematical Text Recognition
链接:https://arxiv.org/abs/2601.04752
作者:Masatomo Yoshida,Haruto Namura,Nicola Adami,Masahiro Okuda
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:search space effectively, adversarial attack method, attack method utilizing, method utilizing skeletonization, space effectively
备注: accepted to ITC-CSCC 2025
点击查看摘要
Abstract:This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models' visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.
53. 【2601.04734】AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection
链接:https://arxiv.org/abs/2601.04734
作者:Yunqing Hu,Zheming Yang,Chang Zhao,Qi Guo,Meng Gao,Pengcheng Li,Wen Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, large language models, resource-constrained edge-cloud deployment, precise object localization, Multimodal large
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM's robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.
54. 【2601.04727】raining a Custom CNN on Five Heterogeneous Image Datasets
链接:https://arxiv.org/abs/2601.04727
作者:Anika Tabassum,Tasnuva Mahazabin Tuba,Nafisa Naznin
类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:Convolutional Neural Networks, meaningful feature representations, feature representations directly, transformed visual data, learning meaningful feature
备注:
点击查看摘要
Abstract:Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
Cite as:
arXiv:2601.04727 [cs.CV]
(or
arXiv:2601.04727v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.04727
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
55. 【2601.04715】On the Holistic Approach for Detecting Human Image Forgery
链接:https://arxiv.org/abs/2601.04715
作者:Xiao Guo,Jie Zhu,Anil Jain,Xiaoming Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:photorealistic human bodies, entire photorealistic human, Large Language Model, human image forgery, AI-generated content
备注: 6 figures, 5 tables
点击查看摘要
Abstract:The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.
56. 【2601.04706】Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models
链接:https://arxiv.org/abs/2601.04706
作者:Yanbing Zeng,Jia Wang,Hanghang Ma,Junqiang Wu,Jie Zhu,Xiaoming Wei,Jie Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Integrating image generation, Integrating image, pivotal goal, Bridge Adapter, Bridge Feature
备注:
点击查看摘要
Abstract:Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at this https URL.
57. 【2601.04692】See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation
链接:https://arxiv.org/abs/2601.04692
作者:Naquee Rizwan,Subhankar Swain,Paramananda Bhaskar,Gagan Aryan,Shehryaar Shah Khan,Animesh Mukherjee
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:complementary angles, examine hateful memes, intervene them prior, applying a range, range of strategies
备注:
点击查看摘要
Abstract:In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.
58. 【2601.04687】WebCryptoAgent: Agentic Crypto Trading with Web Informatics
链接:https://arxiv.org/abs/2601.04687
作者:Ali Kurban,Wei Luo,Liangyu Zuo,Zeyu Zhang,Renda Han,Zhaolu Kang,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:heterogeneous web information, support short-horizon decision, trading increasingly depends, extreme volatility, increasingly depends
备注:
点击查看摘要
Abstract:Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at this https URL.
59. 【2601.04682】HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution
链接:https://arxiv.org/abs/2601.04682
作者:Yang Zou,Xingyue Zhu,Kaiqi Han,Jun Ma,Xingyuan Li,Zhiying Jiang,Jinyuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe atmospheric turbulence, InfraRed Video Super-Resolution, challenging environments, Existing video super-resolution, great interest
备注:
点击查看摘要
Abstract:Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: this https URL
60. 【2601.04676】DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation
链接:https://arxiv.org/abs/2601.04676
作者:Qiu Guan,Zhiqiang Yang,Dezhang Ye,Yang Chen,Xinli Xu,Ying Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-scale Mamba Module, scans is crucial, precise diagnosis, diagnosis and treatment, Multi-scale Mamba
备注:
点击查看摘要
Abstract:Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.
61. 【2601.04672】Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning
链接:https://arxiv.org/abs/2601.04672
作者:Wentao Zhang,Lifei Wang,Lina Lu,MingKun Xu,Shangyang Li,Yanchao Yang,Tao Fang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:diagnosis challenges VLMs, requires extensive labels, disease diagnosis challenges, conventional fine-tuning requires, fine-tuning requires extensive
备注: This paper is submitted for review to ACL 2026. It is 17 pages long and includes 5 figures. The corresponding authors are Tao Fang and Lina Lu
点击查看摘要
Abstract:Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19\% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2\% relative gain in disease recognition accuracy, +33.3\% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.
62. 【2601.04614】HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment
链接:https://arxiv.org/abs/2601.04614
作者:Wenzhi Chen,Bo Hu,Leida Li,Lihuo He,Wen Lu,Xinbo Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation technology, accurately assessing, critical challenge, rapid development, generated images
备注:
点击查看摘要
Abstract:With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.
63. 【2601.04607】HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation
链接:https://arxiv.org/abs/2601.04607
作者:Xiaoyu Liu,Siwen Wei,Linhao Qu,Mingyuan Pan,Chengsheng Zhang,Yonghong Shi,Zhijian Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:complexly shaped organs, Accurate segmentation, radiation therapy, fail on small, complexly shaped
备注:
点击查看摘要
Abstract:Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.
64. 【2601.04605】Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems
链接:https://arxiv.org/abs/2601.04605
作者:Bernard Ngabonziza,Ayan Banerjee,Sandeep K.S. Gupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly involved artificial, involved artificial intelligence, enable knowledge extraction, recent years, sensor-collected data
备注:
点击查看摘要
Abstract:In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system's safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2601.04605 [cs.CV]
(or
arXiv:2601.04605v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.04605
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
65. 【2601.04589】MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing
链接:https://arxiv.org/abs/2601.04589
作者:Zihao Lin,Wanrong Zhu,Jiuxiang Gu,Jihyung Kil,Christopher Tensmeyer,Lin Zhang,Shilong Liu,Ruiyi Zhang,Lifu Huang,Vlad I. Morariu,Tong Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world design documents, Real-world design, combining decoration, Multi-Layer Document Editing, document editing
备注:
点击查看摘要
Abstract:Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.
66. 【2601.04588】3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks
链接:https://arxiv.org/abs/2601.04588
作者:Yusri Al-Sanaani,Rebecca Thornhill,Sreeraman Rajan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:quantifying atrial fibrosis, atrial fibrillation, left atrial, quantifying atrial, atrial fibrosis
备注: This work has been published in the Proceedings of the 2025 IEEE International Conference on Imaging Systems and Techniques (IST). The final published version is available via IEEE Xplore
点击查看摘要
Abstract:Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p 0.05) over the this http URL findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.
67. 【2601.04567】All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction
链接:https://arxiv.org/abs/2601.04567
作者:Ziyou Jiang,Mingyang Li,Junjie Wang,Yuekai Huang,Jie Huang,Zhiyuan Chang,Zhaoyang Li,Qing Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Internet communities, design concept, Harmful memes, memes, Harmful
备注: 18 pages, 11 figures
点击查看摘要
Abstract:Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.
68. 【2601.04563】A Vision for Multisensory Intelligence: Sensing, Synergy, and Science
链接:https://arxiv.org/abs/2601.04563
作者:Paul Pu Liang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:spanning a synthesis, synthesis of language, multisensory artificial intelligence, artificial intelligence, Multisensory Intelligence group
备注:
点击查看摘要
Abstract:Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see this https URL.
69. 【2601.04520】FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer
链接:https://arxiv.org/abs/2601.04520
作者:Chengyang Li,Baoping Cheng,Yao Cheng,Haocheng Zhang,Renshuai Liu,Yinglin Zheng,Jing Liao,Xuan Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent facial texture, compelling full texture, Recent facial, synthesize image content, generation methods prefer
备注: Accepted by IEEE Transactions on Multimedia
点击查看摘要
Abstract:Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods' generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.
70. 【2601.04519】okenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
链接:https://arxiv.org/abs/2601.04519
作者:Sen Zeng,Hong Zhou,Zheng Zhu,Yang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computationally demanding task, demanding task due, Three-dimensional medical image, medical image segmentation, Three-dimensional medical
备注:
点击查看摘要
Abstract:Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.
71. 【2601.04510】owards Spatio-Temporal Extrapolation of Phase-Field Simulations with Convolution-Only Neural Networks
链接:https://arxiv.org/abs/2601.04510
作者:Christophe Bonneville,Nathan Bieberdorf,Pieterjan Robbe,Mark Asta,Habib Najm,Laurent Capolungo,Cosmin Safta
类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
关键词:liquid metal dealloying, capture complex microstructural, complex microstructural evolutions, metal dealloying, liquid metal
备注:
点击查看摘要
Abstract:Phase-field simulations of liquid metal dealloying (LMD) can capture complex microstructural evolutions but can be prohibitively expensive for large domains and long time horizons. In this paper, we introduce a fully convolutional, conditionally parameterized U-Net surrogate designed to extrapolate far beyond its training data in both space and time. The architecture integrates convolutional self-attention, physically informed padding, and a flood-fill corrector method to maintain accuracy under extreme extrapolation, while conditioning on simulation parameters allows for flexible time-step skipping and adaptation to varying alloy compositions. To remove the need for costly solver-based initialization, we couple the surrogate with a conditional diffusion model that generates synthetic, physically consistent initial conditions. We train our surrogate on simulations generated over small domain sizes and short time spans, but, by taking advantage of the convolutional nature of U-Nets, we are able to run and extrapolate surrogate simulations for longer time horizons than what would be achievable with classic numerical solvers. Across multiple alloy compositions, the framework is able to reproduce the LMD physics accurately. It predicts key quantities of interest and spatial statistics with relative errors typically below 5% in the training regime and under 15% during large-scale, long time-horizon extrapolations. Our framework can also deliver speed-ups of up to 36,000 times, bringing the time to run weeks-long simulations down to a few seconds. This work is a first stepping stone towards high-fidelity extrapolation in both space and time of phase-field simulation for LMD.
72. 【2601.04498】IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation
链接:https://arxiv.org/abs/2601.04498
作者:Yinghao Tang,Xueding Liu,Boyuan Zhang,Tingfeng Lan,Yupeng Xie,Jiale Lao,Yiyao Wang,Haoxuan Li,Tingting Gao,Bo Pan,Luoxuan Weng,Xiuqi Huang,Minfeng Zhu,Yingchaojie Feng,Yuyu Luo,Wei Chen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:composite visual artifacts, combine data visualizations, communicate information, composite visual, visual artifacts
备注:
点击查看摘要
Abstract:Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at this https URL.
73. 【2601.04497】Vision-Language Agents for Interactive Forest Change Analysis
链接:https://arxiv.org/abs/2601.04497
作者:James Brock,Ce Zhang,Nantheera Anantrasirichai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Modern forest monitoring, monitoring workflows increasingly, workflows increasingly benefit, forest monitoring workflows, Modern forest
备注: 5 pages, 4 figures, Submitted to IGARSS 2026
点击查看摘要
Abstract:Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at this https URL.
74. 【2601.04453】UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
链接:https://arxiv.org/abs/2601.04453
作者:Zhexiao Xiong,Xin Ye,Burhan Yaman,Sheng Cheng,Yiren Lu,Jingru Luo,Nathan Jacobs,Liu Ren
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:safe control, crucial for safe, accurate scene understanding, future, VLM-based world model
备注: Project Page: [this https URL](https://unidrive-wm.github.io/UniDrive-WM)
点击查看摘要
Abstract:World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at this https URL .
75. 【2601.04442】Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
链接:https://arxiv.org/abs/2601.04442
作者:Xingjian Diao,Zheyuan Liu,Chunhui Zhang,Weiyi Wu,Keyi Kong,Lin Shi,Kaize Ding,Soroush Vosoughi,Jiang Gui
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, mechanisms that generate, exhibited strong reasoning, strong reasoning capabilities
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.
76. 【2601.04428】CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction
链接:https://arxiv.org/abs/2601.04428
作者:Donghang Lyu,Marius Staring,Hildo Lamb,Mariya Doneva
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cardiac MRI, higher acceleration factors, handling higher acceleration, field of Cardiac, cal applications
备注: STACOM 2025
点击查看摘要
Abstract:In recent years, deep learning has attracted increasing at- tention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clini- cal applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or nar- row subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that lever- ages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently out- performs baseline methods across a wide range of settings, highlighting its effectiveness and promise.
77. 【2601.04405】From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery
链接:https://arxiv.org/abs/2601.04405
作者:Yike Zhang,Eduardo Davalos,Dingjie Su,Ange Lou,Jack Noble
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cochlear Implant, surgery treats severe, treats severe hearing, severe hearing loss, surgery treats
备注: arXiv admin note: substantial text overlap with [arXiv:2505.18368](https://arxiv.org/abs/2505.18368)
点击查看摘要
Abstract:Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.
78. 【2601.04404】3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation
链接:https://arxiv.org/abs/2601.04404
作者:Jusheng Zhang,Yijia Fan,Zimo Wen,Jian Wang,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Objaverse LVIS Objaverse, propose Tri MARF, Tri MARF consists, Tri MARF substantially, annotation Tri MARF
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU
79. 【2601.04397】Performance Analysis of Image Classification on Bangladeshi Datasets
链接:https://arxiv.org/abs/2601.04397
作者:Mohammed Sami Khan,Fabiha Muniat,Rowzatul Zannat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, Convolutional Neural, Neural Networks, demonstrated remarkable success, custom CNN
备注:
点击查看摘要
Abstract:Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.
80. 【2601.04382】In-SRAM Radiant Foam Rendering on a Graph Processor
链接:https://arxiv.org/abs/2601.04382
作者:Zulkhuu Tuya,Ignacio Alzugaray,Nicholas Fry,Andrew J. Davison
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:single large device, small local SRAM, lightweight cores, large device memory, replace a single
备注: 24 pages, 26 figures
点击查看摘要
Abstract:Many emerging many-core accelerators replace a single large device memory with hundreds to thousands of lightweight cores, each owning only a small local SRAM and exchanging data via explicit on-chip communication. This organization offers high aggregate bandwidth, but it breaks a key assumption behind many volumetric rendering techniques: that rays can randomly access a large, unified scene representation. Rendering efficiently on such hardware therefore requires distributing both data and computation, keeping ray traversal mostly local, and structuring communication into predictable routes. We present a fully in-SRAM, distributed renderer for the \emph{Radiant Foam} Voronoi-cell volumetric representation on the Graphcore Mk2 IPU, a many-core accelerator with tile-local SRAM and explicit inter-tile communication. Our system shards the scene across tiles and forwards rays between shards through a hierarchical routing overlay, enabling ray marching entirely from on-chip SRAM with predictable communication. On Mip-NeRF~360 scenes, the system attains near-interactive throughput (\(\approx\)1\,fps at \mbox{$640\times480$}) with image and depth quality close to the original GPU-based Radiant Foam implementation, while keeping all scene data and ray state in on-chip SRAM. Beyond demonstrating feasibility, we analyze routing, memory, and scheduling bottlenecks that inform how future distributed-memory accelerators can better support irregular, data-movement-heavy rendering workloads.
Comments:
24 pages, 26 figures
Subjects:
Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2601.04382 [cs.GR]
(or
arXiv:2601.04382v1 [cs.GR] for this version)
https://doi.org/10.48550/arXiv.2601.04382
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
81. 【2601.04381】Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection
链接:https://arxiv.org/abs/2601.04381
作者:Maxim Clouser,Kia Khezeli,John Kalantari
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:safety-critical applications rely, synthetic aperture radar, RGB, aperture radar, Foundation models
备注:
点击查看摘要
Abstract:Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.
82. 【2601.04378】Aligned explanations in neural networks
链接:https://arxiv.org/abs/2601.04378
作者:Corentin Lobet,Francesca Chiaromonte
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:explaining deep neural, dominant paradigm, paradigm for explaining, Feature attribution, deep neural networks
备注:
点击查看摘要
Abstract:Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model's prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.
83. 【2601.04376】Combining facial videos and biosignals for stress estimation during driving
链接:https://arxiv.org/abs/2601.04376
作者:Paraskevi Valergaki,Vassilis C. Nicodemou,Iason Oikonomidis,Antonis Argyros,Anastasios Roussos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:voluntary facial control, Facial Action Units, Reliable stress recognition, stress subjective nature, Reliable stress
备注: UNDER SUBMISSION TO ICPR 2026
点击查看摘要
Abstract:Reliable stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92\%, Accuracy 86.7\%), with EMOCA-gaze fusion also competitive (AUROC 91.8\%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.
84. 【2601.04359】PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache
链接:https://arxiv.org/abs/2601.04359
作者:Kunyang Li,Mubarak Shah,Yuzhang Shang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse multimodal tasks, addresses diverse multimodal, shared token space, Transformer-based framework, multimodal tasks
备注:
点击查看摘要
Abstract:A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.
85. 【2601.04356】UNIC: Learning Unified Multimodal Extrinsic Contact Estimation
链接:https://arxiv.org/abs/2601.04356
作者:Zhengtong Xu,Yuki Shirai
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:provide essential contextual, essential contextual information, requires reliable estimation, extrinsic contacts-the interactions, manipulation requires reliable
备注:
点击查看摘要
Abstract:Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.
86. 【2601.04352】Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets
链接:https://arxiv.org/abs/2601.04352
作者:Ibrahim Tanvir(University of Dhaka),Alif Ruslan(University of Dhaka),Sartaj Solaiman(University of Dhaka)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Convolutional Neural Networks, custom-built Convolutional Neural, Neural Networks, Convolutional Neural, Auto Rickshaw Detection
备注:
点击查看摘要
Abstract:This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.
87. 【2601.04348】SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting
链接:https://arxiv.org/abs/2601.04348
作者:Diego Revilla,Pooja Suresh,Anand Bhojan,Ooi Wei Tsang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:allowed for real-time, high-fidelity novel view, view synthesis, Gaussian Splatting, Recent advances
备注:
点击查看摘要
Abstract:Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:
arXiv:2601.04348 [cs.CV]
(or
arXiv:2601.04348v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2601.04348
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
88. 【2601.04342】ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
链接:https://arxiv.org/abs/2601.04342
作者:Mohsen Ghafoorian,Amirhossein Habibian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, severely limits scalability, transformer-based architectures, longer sequences, quadratic attention complexity
备注:
点击查看摘要
Abstract:Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at this https URL.
89. 【2601.04339】Unified Text-Image Generation with Weakness-Targeted Post-Training
链接:https://arxiv.org/abs/2601.04339
作者:Jiahui Chen,Philippe Hansen-Estruch,Xiaochuang Han,Yushi Hu,Emily Dinan,Amita Kamath,Michal Drozdzal,Reyhane Askari-Hemmat,Luke Zettlemoyer,Marjan Ghazvininejad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:jointly produce text, architectures that jointly, jointly produce, recently emerged, promising direction
备注:
点击查看摘要
Abstract:Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.
90. 【2601.04302】Embedding Textual Information in Images Using Quinary Pixel Combinations
链接:https://arxiv.org/abs/2601.04302
作者:A V Uday Kiran Kandala
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pixel intensity combinations, pixel intensity, Region based methods, RGB space, pixel
备注:
点击查看摘要
Abstract:This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.
91. 【2601.04300】Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes
链接:https://arxiv.org/abs/2601.04300
作者:Chenye Meng,Zejian Li,Zhongni Liu,Yize Li,Changle Xie,Kaixin Jia,Ling Yang,Huanghuang Deng,Shiying Ding,Shengyuan Zhang,Jiayi Li,Lingyun Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Post-training alignment, simplified signals, relies on simplified, scalar rewards, rewards or binary
备注:
点击查看摘要
Abstract:Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.
92. 【2601.04297】ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues
链接:https://arxiv.org/abs/2601.04297
作者:Behrad Binaei-Haghighi,Nafiseh Sadat Sajadi,Mehrad Liviyan,Reyhane Akhavan Kharazi,Fatemeh Amirkhani,Behnam Bahrak
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
关键词:non-verbal channels, human affective, significant challenge, behavioral kinematic cues, psychological
备注: 12 pages, 7 figures
点击查看摘要
Abstract:The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
93. 【2601.04203】FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
链接:https://arxiv.org/abs/2601.04203
作者:Xueqing Wu,Zihan Xue,Da Yin,Shuyan Zhou,Kai-Wei Chang,Nanyun Peng,Yeming Wen
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:conversational code generation, code generation, front-end code generation, front-end development, pioneers the study
备注:
点击查看摘要
Abstract:We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL
94. 【2601.03323】Listen to Rhythm, Choose Movements: Autoregressive Multimodal Dance Generation via Diffusion and Mamba with Decoupled Dance Dataset
链接:https://arxiv.org/abs/2601.03323
作者:Oran Duan,Yinghua Shen,Yingzhu Lv,Luyang Jie,Yaxin Liu,Qiong Wu
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
关键词:greatly promoted research, coarse semantic control, dance motion generation, Advances in generative, autoregressive dance motion
备注: 12 pages, 13 figures
点击查看摘要
Abstract:Advances in generative models and sequence learning have greatly promoted research in dance motion generation, yet current methods still suffer from coarse semantic control and poor coherence in long sequences. In this work, we present Listen to Rhythm, Choose Movements (LRCM), a multimodal-guided diffusion framework supporting both diverse input modalities and autoregressive dance motion generation. We explore a feature decoupling paradigm for dance datasets and generalize it to the Motorica Dance dataset, separating motion capture data, audio rhythm, and professionally annotated global and local text descriptions. Our diffusion architecture integrates an audio-latent Conformer and a text-latent Cross-Conformer, and incorporates a Motion Temporal Mamba Module (MTMM) to enable smooth, long-duration autoregressive synthesis. Experimental results indicate that LRCM delivers strong performance in both functional capability and quantitative metrics, demonstrating notable potential in multimodal input scenarios and extended sequence generation. We will release the full codebase, dataset, and pretrained models publicly upon acceptance.
95. 【2601.05063】Quantitative mapping from conventional MRI using self-supervised physics-guided deep learning: applications to a large-scale, clinically heterogeneous dataset
链接:https://arxiv.org/abs/2601.05063
作者:Jelmer van Lune,Stefano Mandija,Oscar van der Heide,Matteo Maspero,Martin B. Schilder,Jan Willem Dankbaar,Cornelis A.T. van den Berg,Alessandro Sbrizzi
类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Magnetic resonance imaging, provide qualitative information, qualitative information heavily, information heavily dependent, MRIs provide qualitative
备注: 30 pages, 13 figures, full paper
点击查看摘要
Abstract:Magnetic resonance imaging (MRI) is a cornerstone of clinical neuroimaging, yet conventional MRIs provide qualitative information heavily dependent on scanner hardware and acquisition settings. While quantitative MRI (qMRI) offers intrinsic tissue parameters, the requirement for specialized acquisition protocols and reconstruction algorithms restricts its availability and impedes large-scale biomarker research. This study presents a self-supervised physics-guided deep learning framework to infer quantitative T1, T2, and proton-density (PD) maps directly from widely available clinical conventional T1-weighted, T2-weighted, and FLAIR MRIs. The framework was trained and evaluated on a large-scale, clinically heterogeneous dataset comprising 4,121 scan sessions acquired at our institution over six years on four different 3 T MRI scanner systems, capturing real-world clinical variability. The framework integrates Bloch-based signal models directly into the training objective. Across more than 600 test sessions, the generated maps exhibited white matter and gray matter values consistent with literature ranges. Additionally, the generated maps showed invariance to scanner hardware and acquisition protocol groups, with inter-group coefficients of variation $\leq$ 1.1%. Subject-specific analyses demonstrated excellent voxel-wise reproducibility across scanner systems and sequence parameters, with Pearson $r$ and concordance correlation coefficients exceeding 0.82 for T1 and T2. Mean relative voxel-wise differences were low across all quantitative parameters, especially for T2 ($$ 6%). These results indicate that the proposed framework can robustly transform diverse clinical conventional MRI data into quantitative maps, potentially paving the way for large-scale quantitative biomarker research.
96. 【2601.05020】Scalable neural pushbroom architectures for real-time denoising of hyperspectral images onboard satellites
链接:https://arxiv.org/abs/2601.05020
作者:Ziyao Yi,Davide Piccinini,Diego Valsesia,Tiziano Bianchi,Enrico Magli
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Earth observation satellites, generation of Earth, Earth observation, deploy intelligent models, intelligent models directly
备注:
点击查看摘要
Abstract:The next generation of Earth observation satellites will seek to deploy intelligent models directly onboard the payload in order to minimize the latency incurred by the transmission and processing chain of the ground segment, for time-critical applications. Designing neural architectures for onboard execution, particularly for satellite-based hyperspectral imagers, poses novel challenges due to the unique constraints of this environment and imaging system that are largely unexplored by the traditional computer vision literature. In this paper, we show that this setting requires addressing three competing objectives, namely high-quality inference with low complexity, dynamic power scalability and fault tolerance. We focus on the problem of hyperspectral image denoising, which is a critical task to enable effective downstream inference, and highlights the constraints of the onboard processing scenario. We propose a neural network design that addresses the three aforementioned objectives with several novel contributions. In particular, we propose a mixture of denoisers that can be resilient to radiation-induced faults as well as allowing for time-varying power scaling. Moreover, each denoiser employs an innovative architecture where an image is processed line-by-line in a causal way, with a memory of past lines, in order to match the acquisition process of pushbroom hyperspectral sensors and greatly limit memory requirements. We show that the proposed architecture can run in real-time, i.e., process one line in the time it takes to acquire the next one, on low-power hardware and provide competitive denoising quality with respect to significantly more complex state-of-the-art models. We also show that the power scalability and fault tolerance objectives provide a design space with multiple tradeoffs between those properties and denoising quality.
97. 【2601.04825】Illumination Angular Spectrum Encoding for Controlling the Functionality of Diffractive Networks
链接:https://arxiv.org/abs/2601.04825
作者:Matan Kleiner,Lior Michaeli,Tomer Michaeli
类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:recently emerged, illumination angular spectrum, illumination, illumination angular, Diffractive neural networks
备注: Project's code [this https URL](https://github.com/matankleiner/Angular-Spectrum-Encoding)
点击查看摘要
Abstract:Diffractive neural networks have recently emerged as a promising framework for all-optical computing. However, these networks are typically trained for a single task, limiting their potential adoption in systems requiring multiple functionalities. Existing approaches to achieving multi-task functionality either modify the mechanical configuration of the network per task or use a different illumination wavelength or polarization state for each task. In this work, we propose a new control mechanism, which is based on the illumination's angular spectrum. Specifically, we shape the illumination using an amplitude mask that selectively controls its angular spectrum. We employ different illumination masks for achieving different network functionalities, so that the mask serves as a unique task encoder. Interestingly, we show that effective control can be achieved over a very narrow angular range, within the paraxial regime. We numerically illustrate the proposed approach by training a single diffractive network to perform multiple image-to-image translation tasks. In particular, we demonstrate translating handwritten digits into typeset digits of different values, and translating handwritten English letters into typeset numbers and typeset Greek letters, where the type of the output is determined by the illumination's angular components. As we show, the proposed framework can work under different coherence conditions, and can be combined with existing control strategies, such as different wavelengths. Our results establish the illumination angular spectrum as a powerful degree of freedom for controlling diffractive networks, enabling a scalable and versatile framework for multi-task all-optical computing.
98. 【2601.04370】End-to-end differentiable design of geometric waveguide displays
链接:https://arxiv.org/abs/2601.04370
作者:Xinge Yang,Zhaocheng Liu,Zhaoyu Nie,Qingyuan Fan,Zhimin Shi,Jim Bonar,Wolfgang Heidrich
类目:Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:augmented reality displays, see-through augmented reality, polarization-dependent multilayer thin-film, non-sequential Monte Carlo, Monte Carlo
备注:
点击查看摘要
Abstract:Geometric waveguides are a promising architecture for optical see-through augmented reality displays, but their performance is severely bottlenecked by the difficulty of jointly optimizing non-sequential light transport and polarization-dependent multilayer thin-film coatings. Here we present the first end-to-end differentiable optimization framework for geometric waveguide that couples non-sequential Monte Carlo polarization ray tracing with a differentiable transfer-matrix thin-film solver. A differentiable Monte Carlo ray tracer avoids the exponential growth of deterministic ray splitting while enabling gradients backpropagation from eyebox metrics to design parameters. With memory-saving strategies, we optimize more than one thousand layer-thickness parameters and billions of non-sequential ray-surface intersections on a single multi-GPU workstation. Automated layer pruning is achieved by starting from over-parameterized stacks and driving redundant layers to zero thickness under discrete manufacturability constraints, effectively performing topology optimization to discover optimal coating structures. On a representative design, starting from random initialization within thickness bounds, our method increases light efficiency from 4.1\% to 33.5\% and improves eyebox and FoV uniformity by $\sim$17$\times$ and $\sim$11$\times$, respectively. Furthermore, we jointly optimize the waveguide and an image preprocessing network to improve perceived image quality. Our framework not only enables system-level, high-dimensional coating optimization inside the waveguide, but also expands the scope of differentiable optics for next-generation optical design.




