本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新1105篇论文,其中:
- 自然语言处理184篇
- 信息检索44篇
- 计算机视觉253篇
自然语言处理
1. 【2510.23606】Variational Masked Diffusion Models
链接:https://arxiv.org/abs/2510.23606
作者:Yichi Zhang,Alex Schwing,Zhizhen Zhao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:discrete generative modeling, Masked diffusion, generative modeling, Masked, recently emerged
备注: Project Page: [this https URL](https://riccizz.github.io/VMD)
点击查看摘要
Abstract:Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: this https URL.
2. 【2510.23596】hink Twice: Branch-and-Rethink Reasoning Reward Model
链接:https://arxiv.org/abs/2510.23596
作者:Yizhu Jiao,Jiaqi Zeng,Julien Veron Vialard,Oleksii Kuchaiev,Jiawei Han,Olivier Delalleau
类目:Computation and Language (cs.CL)
关键词:extra test-time compute, elicit stronger reasoning, externalize intermediate steps, allocate extra test-time, Large language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.
3. 【2510.23585】Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
链接:https://arxiv.org/abs/2510.23585
作者:Luis Ramos,Hiram Calvo,Olga Kolesnikova
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:promised NLP task, social media platforms, NLP task, promised NLP, detect motivational expressions
备注:
点击查看摘要
Abstract:The identification of hope speech has become a promised NLP task, considering the need to detect motivational expressions of agency and goal-directed behaviour on social media platforms. This proposal evaluates traditional machine learning models and fine-tuned transformers for a previously split hope speech dataset as train, development and test set. On development test, a linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM with RBF kernel reached 0.77, and Naïve Bayes hit 0.75. Transformer models delivered better results, the best model achieved weighted precision of 0.82, weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80 accuracy. These results suggest that while optimally configured traditional machine learning models remain agile, transformer architectures detect some subtle semantics of hope to achieve higher precision and recall in hope speech detection, suggesting that larges transformers and LLMs could perform better in small datasets.
4. 【2510.23564】ReCode: Unify Plan and Action for Universal Granularity Control
链接:https://arxiv.org/abs/2510.23564
作者:Zhaoyang Yu,Jiayi Zhang,Huixue Su,Yufan Zhao,Yifan Wu,Mingyi Deng,Jinyu Xiang,Yizhang Lin,Lingxiao Tang,Yingchao Li,Yuyu Luo,Bang Liu,Chenglin Wu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Real-world tasks require, Real-world tasks, tasks require decisions, Large Language Model, current Large Language
备注:
点击查看摘要
Abstract:Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at this https URL.
5. 【2510.23558】ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
链接:https://arxiv.org/abs/2510.23558
作者:Bohan Li,Wenbin Huang,Yuhang Qiu,Yiwei Guo,Hankun Wang,Zhihan Li,Jing Peng,Ziyang Ma,Xie Chen,Kai Yu
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:large language models, Large Audio Language, large language, couple acoustic perception, understand diverse information
备注: submitted to icassp 2026
点击查看摘要
Abstract:Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.
6. 【2510.23554】A U-Net and Transformer Pipeline for Multilingual Image Translation
链接:https://arxiv.org/abs/2510.23554
作者:Siddharth Sahay,Radhika Agarwal
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Machine Translation, Neural Machine, Machine Translation, Transformer for Neural, Tesseract engine
备注: 6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE double-column format
点击查看摘要
Abstract:This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
7. 【2510.23544】LimRank: Less is More for Reasoning-Intensive Information Reranking
链接:https://arxiv.org/abs/2510.23544
作者:Tingyu Song,Yilun Zhao,Siyue Zhang,Chen Zhao,Arman Cohan
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Existing approaches typically, Existing approaches, approaches typically rely, computationally expensive, rely on large-scale
备注: EMNLP 2025 Main (Short)
点击查看摘要
Abstract:Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
8. 【2510.23538】JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
链接:https://arxiv.org/abs/2510.23538
作者:Qiushi Sun,Jingyang Gong,Yang Liu,Qiaosheng Chen,Lei Li,Kai Chen,Qipeng Guo,Ben Kao,Fei Yuan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
关键词:rich visual outputs, neural code intelligence, text-based source code, programs generate, scope of neural
备注: Work in progress
点击查看摘要
Abstract:The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at this https URL.
9. 【2510.23536】IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
链接:https://arxiv.org/abs/2510.23536
作者:Jieyong Kim,Maryam Amirizaniani,Soojin Yoon,Dongha Lee
类目:Computation and Language (cs.CL)
关键词:personalized question answering, Intent identification serves, Intent identification, core Intent identification, core intents
备注:
点击查看摘要
Abstract:Intent identification serves as the foundation for generating appropriate responses in personalized question answering (PQA). However, existing benchmarks evaluate only response quality or retrieval performance without directly measuring intent identification capabilities. This gap is critical because without understanding which intents users prioritize, systems cannot generate responses satisfying individual information needs. To address this, we introduce the concept of core intents: intents users prioritize when selecting answers to satisfy their information needs. To evaluate these core intents, we propose IPQA, a benchmark for core Intent identification in Personalized Question Answering. Since users do not explicitly state their prioritized intents, we derive core intents from observable behavior patterns in answer selection, grounded in satisficing theory where users choose answers meeting their acceptance thresholds. We construct a dataset with various domains through systematic filtering, LLM-based annotation, and rigorous quality control combining automated verification with human validation. Experimental evaluations across state-of-the-art language models reveal that current systems struggle with core intent identification in personalized contexts. Models fail to identify core intents from user histories, with performance degrading as question complexity increases. The code and dataset will be made publicly available to facilitate future research in this direction.
10. 【2510.23508】M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset
链接:https://arxiv.org/abs/2510.23508
作者:Jiahui Geng,Jonathan Tonglet,Iryna Gurevych
类目:Computation and Language (cs.CL)
关键词:Existing real-world datasets, Existing real-world, sourcing true claims, multiple limitations, suffer from evidence
备注: Preprint under review. Code and data available at: [this https URL](https://github.com/UKPLab/M4FC)
点击查看摘要
Abstract:Existing real-world datasets for multimodal automated fact-checking have multiple limitations: they contain few instances, focus on only one or two languages and tasks, suffer from evidence leakage, or depend on external sets of news articles for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent diverse cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks influence downstream verdict prediction performance. We make our dataset and code available.
11. 【2510.23477】MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
链接:https://arxiv.org/abs/2510.23477
作者:Tengchao Yang,Sichen Guo,Mengzhao Jia,Jiaming Su,Yuanyang Liu,Zhihan Zhang,Meng Jiang
类目:Computation and Language (cs.CL)
关键词:diagnosing students' difficulties, Effective math tutoring, Effective math, math tutoring requires, diagnosing students'
备注:
点击查看摘要
Abstract:Effective math tutoring requires not only solving problems but also diagnosing students' difficulties and guiding them step by step. While multimodal large language models (MLLMs) show promise, existing benchmarks largely overlook these tutoring skills. We introduce MMTutorBench, the first benchmark for AI math tutoring, consisting of 685 problems built around pedagogically significant key-steps. Each problem is paired with problem-specific rubrics that enable fine-grained evaluation across six dimensions, and structured into three tasks-Insight Discovery, Operation Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find clear performance gaps between proprietary and open-source systems, substantial room compared to human tutors, and consistent trends across input variants: OCR pipelines degrade tutoring quality, few-shot prompting yields limited gains, and our rubric-based LLM-as-a-Judge proves highly reliable. These results highlight both the difficulty and diagnostic value of MMTutorBench for advancing AI tutoring.
12. 【2510.23464】Evaluating Large Language Models for Stance Detection on Financial Targets from SEC Filing Reports and Earnings Call Transcripts
链接:https://arxiv.org/abs/2510.23464
作者:Nikesh Gyawali,Doina Caragea,Alex Vasenkov,Cornelia Caragea
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Securities and Exchange, Exchange Commission, earnings call transcripts, quarterly earnings call, call transcripts
备注:
点击查看摘要
Abstract:Financial narratives from U.S. Securities and Exchange Commission (SEC) filing reports and quarterly earnings call transcripts (ECTs) are very important for investors, auditors, and regulators. However, their length, financial jargon, and nuanced language make fine-grained analysis difficult. Prior sentiment analysis in the financial domain required a large, expensive labeled dataset, making the sentence-level stance towards specific financial targets challenging. In this work, we introduce a sentence-level corpus for stance detection focused on three core financial metrics: debt, earnings per share (EPS), and sales. The sentences were extracted from Form 10-K annual reports and ECTs, and labeled for stance (positive, negative, neutral) using the advanced ChatGPT-o3-pro model under rigorous human validation. Using this corpus, we conduct a systematic evaluation of modern large language models (LLMs) using zero-shot, few-shot, and Chain-of-Thought (CoT) prompting strategies. Our results show that few-shot with CoT prompting performs best compared to supervised baselines, and LLMs' performance varies across the SEC and ECT datasets. Our findings highlight the practical viability of leveraging LLMs for target-specific stance in the financial domain without requiring extensive labeled data.
13. 【2510.23458】BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
链接:https://arxiv.org/abs/2510.23458
作者:Litu Ou,Kuan Li,Huifeng Yin,Liwen Zhang,Zhongwang Zhang,Xixi Wu,Rui Ye,Zile Qiao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Confidence, answer reliability, model uncertainty, confidence scores, Abstract
备注: 25 pages
点击查看摘要
Abstract:Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.
14. 【2510.23451】Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
链接:https://arxiv.org/abs/2510.23451
作者:Zhuoran Jin,Hongbang Yuan,Kejian Zhu,Jiachun Li,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modality Imbalance, offering limited support, fixed binary preference, preference pairs fails, binary preference pairs
备注: 48 pages, 17 figures
点击查看摘要
Abstract:Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
15. 【2510.23443】A Neuro-Symbolic Multi-Agent Approach to Legal-Cybersecurity Knowledge Integration
链接:https://arxiv.org/abs/2510.23443
作者:Chiara Bonfanti,Alessandro Druetto,Cataldo Basile,Tharindu Ranasinghe,Marcos Zampieri
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
关键词:complex information space, research tools struggle, traditional legal research, legal research tools, connections between cases
备注: 7 pages
点击查看摘要
Abstract:The growing intersection of cybersecurity and law creates a complex information space where traditional legal research tools struggle to deal with nuanced connections between cases, statutes, and technical vulnerabilities. This knowledge divide hinders collaboration between legal experts and cybersecurity professionals. To address this important gap, this work provides a first step towards intelligent systems capable of navigating the increasingly intricate cyber-legal domain. We demonstrate promising initial results on multilingual tasks.
16. 【2510.23396】EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting
链接:https://arxiv.org/abs/2510.23396
作者:Musleh Alharthi,Kaleel Mahmood,Sarosh Patel,Ausif Mahmood
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, Transformer architecture, architecture in Natural, Processing has led, Natural Language
备注:
点击查看摘要
Abstract:The immense success of the Transformer architecture in Natural Language Processing has led to its adoption in Time Se ries Forecasting (TSF), where superior performance has been shown. However, a recent important paper questioned their effectiveness by demonstrating that a simple single layer linear model outperforms Transformer-based models. This was soon shown to be not as valid, by a better transformer-based model termed PatchTST. More re cently, TimeLLM demonstrated even better results by repurposing a Large Language Model (LLM) for the TSF domain. Again, a follow up paper challenged this by demonstrating that removing the LLM component or replacing it with a basic attention layer in fact yields better performance. One of the challenges in forecasting is the fact that TSF data favors the more recent past, and is sometimes subject to unpredictable events. Based upon these recent insights in TSF, we propose a strong Mixture of Experts (MoE) framework. Our method combines the state-of-the-art (SOTA) models including xLSTM, en hanced Linear, PatchTST, and minGRU, among others. This set of complimentary and diverse models for TSF are integrated in a Trans former based MoE gating network. Our proposed model outperforms all existing TSF models on standard benchmarks, surpassing even the latest approaches based on MoE frameworks.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.23396 [cs.CL]
(or
arXiv:2510.23396v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.23396
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Musleh Alharthi [view email] [v1]
Mon, 27 Oct 2025 14:55:30 UTC (394 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting, by Musleh Alharthi and 2 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2025-10
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
17. 【2510.23395】Detecting Religious Language in Climate Discourse
链接:https://arxiv.org/abs/2510.23395
作者:Evy Beijen,Pien Pieterse,Yusuf Çelik,Willem Th. van Peursen,Sandjai Bhulai,Meike Morren
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:ostensibly secular domains, permeate contemporary discourse, climate change debates, Religious language, Religious language continues
备注:
点击查看摘要
Abstract:Religious language continues to permeate contemporary discourse, even in ostensibly secular domains such as environmental activism and climate change debates. This paper investigates how explicit and implicit forms of religious language appear in climate-related texts produced by secular and religious nongovernmental organizations (NGOs). We introduce a dual methodological approach: a rule-based model using a hierarchical tree of religious terms derived from ecotheology literature, and large language models (LLMs) operating in a zero-shot setting. Using a dataset of more than 880,000 sentences, we compare how these methods detect religious language and analyze points of agreement and divergence. The results show that the rule-based method consistently labels more sentences as religious than LLMs. These findings highlight not only the methodological challenges of computationally detecting religious language but also the broader tension over whether religious language should be defined by vocabulary alone or by contextual meaning. This study contributes to digital methods in religious studies by demonstrating both the potential and the limitations of approaches for analyzing how the sacred persists in climate discourse.
18. 【2510.23358】How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
链接:https://arxiv.org/abs/2510.23358
作者:Sheri Osborn,Rohit Valecha,H. Raghav Rao,Dan Sass,Anthony Rios
类目:Computation and Language (cs.CL)
关键词:Artificial intelligence, effects on employment, intelligence is reshaping, reshaping labor markets, Artificial
备注: 8 pages + Limitations + References
点击查看摘要
Abstract:Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.
19. 【2510.23341】LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data
链接:https://arxiv.org/abs/2510.23341
作者:Teng Lin
类目:Computation and Language (cs.CL)
关键词:error-prone pattern-matching techniques, resource-intensive large language, methods rely heavily, large language models, remains a critical
备注:
点击查看摘要
Abstract:The scarcity of high-quality knowledge graphs (KGs) remains a critical bottleneck for downstream AI applications, as existing extraction methods rely heavily on error-prone pattern-matching techniques or resource-intensive large language models (LLMs). While recent tools leverage LLMs to generate KGs, their computational demands limit accessibility for low-resource environments. Our paper introduces LightKGG, a novel framework that enables efficient KG extraction from textual data using small-scale language models (SLMs) through two key technical innovations: (1) Context-integrated Graph extraction integrates contextual information with nodes and edges into a unified graph structure, reducing the reliance on complex semantic processing while maintaining more key information; (2) Topology-enhanced relationship inference leverages the inherent topology of the extracted graph to efficiently infer relationships, enabling relationship discovery without relying on complex language understanding capabilities of LLMs. By enabling accurate KG construction with minimal hardware requirements, this work bridges the gap between automated knowledge extraction and practical deployment scenarios while introducing scientifically rigorous methods for optimizing SLM efficiency in structured NLP tasks.
20. 【2510.23340】Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
链接:https://arxiv.org/abs/2510.23340
作者:Anwesha Das,John Duff,Jörg Hoffmann,Vera Demberg
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:rapidly changing environments, improve human-AI collaboration, Rational Speech Act, collaboration on time-sensitive, rapidly changing
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Adaptive agent design offers a way to improve human-AI collaboration on time-sensitive tasks in rapidly changing environments. In such cases, to ensure the human maintains an accurate understanding of critical task elements, an assistive agent must not only identify the highest priority information but also estimate how and when this information can be communicated most effectively, given that human attention represents a zero-sum cognitive resource where focus on one message diminishes awareness of other or upcoming information. We introduce a theoretical framework for adaptive signalling which meets these challenges by using principles of rational communication, formalised as Bayesian reference resolution using the Rational Speech Act (RSA) modelling framework, to plan a sequence of messages which optimise timely alignment between user belief and a dynamic environment. The agent adapts message specificity and timing to the particulars of a user and scenario based on projections of how prior-guided interpretation of messages will influence attention to the interface and subsequent belief update, across several timesteps out to a fixed horizon. In a comparison to baseline methods, we show that this effectiveness depends crucially on combining multi-step planning with a realistic model of user awareness. As the first application of RSA for communication in a dynamic environment, and for human-AI interaction in general, we establish theoretical foundations for pragmatic communication in human-agent teams, highlighting how insights from cognitive science can be capitalised to inform the design of assistive agents.
21. 【2510.23337】BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
链接:https://arxiv.org/abs/2510.23337
作者:Siyuan Zheng,Pai Liu,Xi Chen,Jizheng Dong,Sihan Jia
类目:Computation and Language (cs.CL)
关键词:contextually coherent personas, handcrafted persona prompts, Human-like virtual characters, current methods rely, methods rely heavily
备注:
点击查看摘要
Abstract:Human-like virtual characters are crucial for games, storytelling, and virtual reality, yet current methods rely heavily on annotated data or handcrafted persona prompts, making it difficult to scale up and generate realistic, contextually coherent personas. We create the first QA dataset for BaZi-based persona reasoning, where real human experiences categorized into wealth, health, kinship, career, and relationships are represented as life-event questions and answers. Furthermore, we propose the first BaZi-LLM system that integrates symbolic reasoning with large language models to generate temporally dynamic and fine-grained virtual personas. Compared with mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a 30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information is used, our model's accuracy drops by 20%-45%, showing the potential of culturally grounded symbolic-LLM integration for realistic character simulation.
22. 【2510.23334】Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
链接:https://arxiv.org/abs/2510.23334
作者:Mohammad Atif Quamar,Mohammad Areeb,Nishant Sharma,Ananth Shreekumar,Jonathan Rosenthal,Muslum Ozgur Ozmen,Mikhail Kuznetsov,Z. Berkay Celik
类目:Computation and Language (cs.CL)
关键词:LLM alignment remains, critical challenge, alignment remains, Abstract, alignment
备注:
点击查看摘要
Abstract:LLM alignment remains a critical challenge. Inference-time methods provide a flexible alternative to fine-tuning, but their uniform computational effort often yields suboptimal alignment. We hypothesize that for many alignment tasks, the initial tokens of a response are disproportionately more critical. To leverage this principle, we introduce AdaSearch, a novel blockwise search strategy. It adaptively allocates a fixed computational budget using a sampling schedule, focusing search effort on these critical tokens. We apply AdaSearch to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our comprehensive evaluation across eight LLMs demonstrates that AdaSearch outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates improve by over 10% for harmlessness generation, controlled sentiment generation, and for mathematical reasoning tasks relative to Best-of-N.
23. 【2510.23319】Arabic Little STT: Arabic Children Speech Recognition Dataset
链接:https://arxiv.org/abs/2510.23319
作者:Mouhand Alkadri,Dania Desouki,Khloud Al Jallad
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD)
关键词:Artificial Intelligence, systems fundamentally depends, systems fundamentally, fundamentally depends, depends on high-quality
备注:
点击查看摘要
Abstract:The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.
24. 【2510.23284】DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model
链接:https://arxiv.org/abs/2510.23284
作者:Yuanzhen Xie,Liu Ye,Jiqun Chu,Mochi Gao,Hehuan Liu,Yunzhi Tan,Bo Hu,Zang Li
类目:Computation and Language (cs.CL)
关键词:gained attractive improvements, release of ChatGPT, gained attractive, attractive improvements, tasks
备注:
点击查看摘要
Abstract:Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
25. 【2510.23276】A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results
链接:https://arxiv.org/abs/2510.23276
作者:Thai-Binh Nguyen,Katerina Zmolikova,Pingchuan Ma,Ngoc Quan Pham,Christian Fuegen,Alexander Waibel
类目:Computation and Language (cs.CL)
关键词:Multi-Modal Context-Aware Recognition, ninth CHiME Challenge, Context-Aware Recognition, CHiME Challenge, setting using audio
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.
26. 【2510.23272】Code Aesthetics with Agentic Reward Feedback
链接:https://arxiv.org/abs/2510.23272
作者:Bang Xiao,Lingjie Jiang,Shaohan Huang,Tengchao Lv,Yupan Huang,Xun Wu,Lei Cui,Furu Wei
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, valuable assistants, assistants for developers
备注: 30 pages, 7 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.
27. 【2510.23271】Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding
链接:https://arxiv.org/abs/2510.23271
作者:Mohammed Aljafari,Ismail Alturki,Ahmed Mori,Yehya Kadumi
类目:Computation and Language (cs.CL)
关键词:proprietary Arabic language, proprietary Arabic OCR, Islamic studies, developed by MASARAT, language model developed
备注: 21 pages, 2 figures, 3 tables. Includes appendices on ethical guidelines and training framework. Submitted September 04, 2025
点击查看摘要
Abstract:Mubeen is a proprietary Arabic language model developed by MASARAT SA, optimized for deep understanding of Arabic linguistics, Islamic studies, and cultural heritage. Trained on an extensive collection of authentic Arabic sources significantly expanded by digitizing historical manuscripts via a proprietary Arabic OCR engine, the model incorporates seminal scholarly works in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside thousands of academic theses and peer-reviewed research papers. Conditioned through a deep linguistic engineering framework, Mubeen masters not just the meaning but the eloquence of Arabic, enabling precise understanding across classical texts, contemporary writing, and regional dialects with focus on comprehending user intent and delivering accurate, contextually relevant responses. Unlike other Arabic models relying on translated English data that often fail in intent detection or retrieval-augmented generation (RAG), Mubeen uses native Arabic sources to ensure cultural authenticity and accuracy. Its core innovation is the Practical Closure Architecture, designed to solve the "Utility Gap Crisis" where factually correct answers fail to resolve users' core needs, forcing them into frustrating cycles of re-prompting. By prioritizing clarity and decisive guidance, Mubeen transforms from an information repository into a decisive guide, aligning with Saudi Vision 2030. The model's architecture combines deep heritage specialization with multi-disciplinary expert modules, enabling robust performance across both cultural preservation and general knowledge domains.
28. 【2510.23252】Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
链接:https://arxiv.org/abs/2510.23252
作者:Tawsif Tashwar Dipto,Azmol Hossain,Rubayet Sabbir Faruque,Md. Rezuwan Hassan,Kanij Fatema,Tanmoy Shome,Ruwad Naswan,Md.Foriduzzaman Zihad,Mohaymen Ul Anam,Nazia Tasnim,Hasan Mahmud,Md Kamrul Hasan,Md. Mehedi Hasan Shawon,Farig Sadeque,Tahsin Reasat
类目:Computation and Language (cs.CL)
关键词:Conventional research, automatic speech recognition, recognition modeling relies, fine-tuning task, speech recognition
备注: This manuscript contains 11 pages, 5 tables and 16 figures This was accepted at International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL) 2025
点击查看摘要
Abstract:Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available
29. 【2510.23217】Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
链接:https://arxiv.org/abs/2510.23217
作者:Alois Thomas,Maya Varma,Jean-Benoit Delbrouck,Curtis P. Langlotz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:holds great potential, Automating radiology report, produce clinically critical, Large Vision-Language Models, Automating radiology
备注:
点击查看摘要
Abstract:Automating radiology report generation with Large Vision-Language Models (LVLMs) holds great potential, yet these models often produce clinically critical hallucinations, posing serious risks. Existing hallucination detection methods frequently lack the necessary sentence-level granularity or robust generalization across different LVLM generators. We introduce a novel approach: a sentence-level Process Reward Model (PRM) adapted for this vision-language task. Our PRM predicts the factual correctness of each generated sentence, conditioned on clinical context and preceding text. When fine-tuned on MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM outperforms existing verification techniques, demonstrating, for instance, relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods reliant on internal model states, our PRM demonstrates strong generalization to an unseen LVLM. We further show its practical utility: PRM scores effectively filter low-quality reports, improving F1-CheXbert scores by 4.5% (when discarding the worst 10% of reports). Moreover, when guiding a novel weighted best-of-N selection process on the MIMIC-CXR test set, our PRM show relative improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for BERTScore. These results demonstrate that a lightweight, context-aware PRM provides a model-agnostic safety layer for clinical LVLMs without access to internal activations
30. 【2510.23198】PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
链接:https://arxiv.org/abs/2510.23198
作者:Etienne Goffinet,Shane Bergsma,Avraham Sheinin,Natalia Vassilieva,Shaheer Muhammad,Preslav Nakov,Gurpreet Gosal
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:balance target-domain gains, Continual pre-training, PTPP, Existing CPT scaling, CPT scaling laws
备注:
点击查看摘要
Abstract:Continual pre-training (CPT) for domain adaptation must balance target-domain gains with stability on the base domain. Existing CPT scaling laws typically assume a fixed pre-training budget, which limits their ability to forecast adaptation outcomes for models trained at different tokens-per-parameter (PTPP). We present \emph{PTPP-aware} adaptation scaling laws that make the pre-training budget an explicit variable, enabling accurate \emph{prediction} of adaptation loss at unseen \ptpp. On a multilingual setup (English/Arabic $\rightarrow$ French), PTPP-aware formulations trained on early stages (\ptpp{}=\{15,31\}) predict target loss at \ptpp{}=279 and outperform a PTPP-agnostic \dcpt{} transfer baseline on metrics (Huber-on-log, MAE$_\mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in the appendix. Beyond forecasting, we show a practical use case: planning replay ratios and adaptation token budgets that satisfy target and forgetting constraints under compute limits.
31. 【2510.23189】DREaM: Drug-Drug Relation Extraction via Transfer Learning Method
链接:https://arxiv.org/abs/2510.23189
作者:Ali Fata,Hossein Rahmani,Parinaz Soltanzadeh,Amirhossein Derakhshan,Behrouz Minaei Bidgoli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:predicting side effects, Relation extraction, side effects, plays a crucial, crucial role
备注:
点击查看摘要
Abstract:Relation extraction between drugs plays a crucial role in identifying drug drug interactions and predicting side effects. The advancement of machine learning methods in relation extraction, along with the development of large medical text databases, has enabled the low cost extraction of such relations compared to other approaches that typically require expert knowledge. However, to the best of our knowledge, there are limited datasets specifically designed for drug drug relation extraction currently available. Therefore, employing transfer learning becomes necessary to apply machine learning methods in this domain. In this study, we propose DREAM, a method that first employs a trained relation extraction model to discover relations between entities and then applies this model to a corpus of medical texts to construct an ontology of drug relationships. The extracted relations are subsequently validated using a large language model. Quantitative results indicate that the LLM agreed with 71 of the relations extracted from a subset of PubMed abstracts. Furthermore, our qualitative analysis indicates that this approach can uncover ambiguities in the medical domain, highlighting the challenges inherent in relation extraction in this field.
32. 【2510.23182】SI-Bench: Benchmarking Social Intelligence of Large Language Models in Human-to-Human Conversations
链接:https://arxiv.org/abs/2510.23182
作者:Shuai Huang,Wenxuan Zhao,Jun Gao
类目:Computation and Language (cs.CL)
关键词:develop anthropomorphic abilities, large language models, develop anthropomorphic, anthropomorphic abilities, large language
备注: 17 pages, 9 figures
点击查看摘要
Abstract:As large language models (LLMs) develop anthropomorphic abilities, they are increasingly being deployed as autonomous agents to interact with humans. However, evaluating their performance in realistic and complex social interactions remains a significant challenge. Most previous research built datasets through simulated agent-to-agent interactions, which fails to capture the authentic linguistic styles and relational dynamics found in real human conversations. To address this gap, we introduce SI-Bench, a novel benchmark designed to evaluate aspects of social intelligence in LLMs. Grounded in broad social science theories, SI-Bench contains 2,221 authentic multi-turn dialogues collected from a social networking application. We further selected a subset of 312 dialogues for manual annotation across 8 major models. The experiments show that SOTA models have surpassed the human expert in process reasoning under complex social situations, yet they still fall behind humans in reply quality. Moreover, introducing Chain-of-Thought (CoT) reasoning may degrade the performance of LLMs in social dialogue tasks. All datasets are openly available at this https URL.
33. 【2510.23169】MATCH: Task-Driven Code Evaluation through Contrastive Learning
链接:https://arxiv.org/abs/2510.23169
作者:Marah Ghoummaid,Vladimir Tchuiev,Ofek Glick,Michal Moschkovitz,Dotan Di Castro
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:GitHub Copilot estimated, GitHub Copilot, Copilot estimated, AI-based code generation, increasingly prevalent
备注:
点击查看摘要
Abstract:AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.
34. 【2510.23163】Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs
链接:https://arxiv.org/abs/2510.23163
作者:Hang Lei,Shengyi Zong,Zhaoyan Li,Ziren Zhou,Hao Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, defining narrative structure, character development, television production, foundation for television
备注:
点击查看摘要
Abstract:The screenplay serves as the foundation for television production, defining narrative structure, character development, and dialogue. While Large Language Models (LLMs) show great potential in creative writing, direct end-to-end generation approaches often fail to produce well-crafted screenplays. We argue this failure stems from forcing a single model to simultaneously master two disparate capabilities: creative narrative construction and rigid format adherence. The resulting outputs may mimic superficial style but lack the deep structural integrity and storytelling substance required for professional use. To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage Refinement (DSR), a decomposed framework that decouples creative narrative generation from format conversion. The first stage transforms a brief outline into rich, novel-style prose. The second stage refines this narrative into a professionally formatted screenplay. This separation enables the model to specialize in one distinct capability at each stage. A key challenge in implementing DSR is the scarcity of paired outline-to-novel training data. We address this through hybrid data synthesis: reverse synthesis deconstructs existing screenplays into structured inputs, while forward synthesis leverages these inputs to generate high-quality narrative texts as training targets. Blind evaluations by professional screenwriters show that DSR achieves a 75% win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of human-level performance. Our work demonstrates that decomposed generation architecture with tailored data synthesis effectively specializes LLMs in complex creative domains.
35. 【2510.23160】ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix
链接:https://arxiv.org/abs/2510.23160
作者:Zile Yang,Ling Li,Na Di,Jinlong Pang,Yao Zhou,Hao Cheng,Bo Han,Jiaheng Wei
类目:Computation and Language (cs.CL)
关键词:pre-trained Large Language, Large Language Models, adapts pre-trained Large, Large Language, carefully curated subset
备注:
点击查看摘要
Abstract:Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data informativeness and diversity. Experiments show that ENTP-augmented datasets, constructed exclusively from low-quality data, outperform 13 established data-selection baselines across five instruction-following benchmarks, and even surpass fine-tuning on the full original dataset (approximately 300K examples). Our results highlight the untapped potential of low-quality data and underscore the importance of intelligent purification and synthesis for efficient instruction alignment.
36. 【2510.23142】Rethinking GSPO: The Perplexity-Entropy Equivalence
链接:https://arxiv.org/abs/2510.23142
作者:Chi Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:theta, GSPO length-normalized importance, text, length-normalized importance ratios, establishing their connection
备注: 10 pages, 2 figures
点击查看摘要
Abstract:We provide a new perspective on GSPO's length-normalized importance ratios by establishing their connection to information-theoretic quantities. We show that GSPO's sequence-level weight $s(\theta) = (\pi_\theta/\pi_{\theta_{\text{old}}})^{1/|y|}$ can be equivalently expressed as the inverse perplexity ratio $\text{PPL}_{\theta_{\text{old}}}/\text{PPL}_\theta$ and as the exponential cross-entropy change $\exp(\Delta H)$. While the perplexity-entropy relationship follows from standard definitions, this observation provides a useful lens for understanding GSPO: the algorithm weights policy gradient updates by perplexity ratios, offering an information-theoretic interpretation of the importance weights. This perspective helps explain GSPO's empirical properties, including log-domain variance reduction through geometric averaging and stability in training mixture-of-experts models. We validate the mathematical equivalences and variance predictions through controlled experiments on mathematical reasoning tasks.
37. 【2510.23131】Corpus Frequencies in Morphological Inflection: Do They Matter?
链接:https://arxiv.org/abs/2510.23131
作者:Tomáš Sourada,Jana Straková
类目:Computation and Language (cs.CL)
关键词:express grammatical categories, triples uniformly, grammatical categories, modifying a base, express grammatical
备注: Published in the proceedings of ITAT 2025.15 pages, 1 figure, 4 tables
点击查看摘要
Abstract:The traditional approach to morphological inflection (the task of modifying a base word (lemma) to express grammatical categories) has been, for decades, to consider lexical entries of lemma-tag-form triples uniformly, lacking any information about their frequency distribution. However, in production deployment, one might expect the user inputs to reflect a real-world distribution of frequencies in natural texts. With future deployment in mind, we explore the incorporation of corpus frequency information into the task of morphological inflection along three key dimensions during system development: (i) for train-dev-test split, we combine a lemma-disjoint approach, which evaluates the model's generalization capabilities, with a frequency-weighted strategy to better reflect the realistic distribution of items across different frequency bands in training and test sets; (ii) for evaluation, we complement the standard type accuracy (often referred to simply as accuracy), which treats all items equally regardless of frequency, with token accuracy, which assigns greater weight to frequent words and better approximates performance on running text; (iii) for training data sampling, we introduce a method novel in the context of inflection, frequency-aware training, which explicitly incorporates word frequency into the sampling process. We show that frequency-aware training outperforms uniform sampling in 26 out of 43 languages.
38. 【2510.23123】Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation
链接:https://arxiv.org/abs/2510.23123
作者:Shiwei Li,Xiandi Luo,Haozhao Wang,Xing Tang,Ziqiang Cui,Dugang Liu,Yuhua Li,Xiuqiang He,Ruixuan Li
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, parameter-efficient fine-tuning, method widely, LoRA, large language
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). LoRA essentially describes the projection of an input space into a low-dimensional output space, with the dimensionality determined by the LoRA rank. In standard LoRA, all input tokens share the same weights and undergo an identical input-output projection. This limits LoRA's ability to capture token-specific information due to the inherent semantic differences among tokens. To address this limitation, we propose Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts LoRA weights according to the input token, thereby learning token-wise input-output projections in an end-to-end manner. Formally, the weights of TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA weights but achieves more granular adaptation by learning token-wise LoRA weights (i.e., token-wise input-output projections). Extensive experiments across multiple models and datasets demonstrate that TopLoRA consistently outperforms LoRA and its variants. The code is available at this https URL.
39. 【2510.23114】Flexing in 73 Languages: A Single Small Model for Multilingual Inflection
链接:https://arxiv.org/abs/2510.23114
作者:Tomáš Sourada,Jana Straková
类目:Computation and Language (cs.CL)
关键词:express grammatical categories, generating inflected word, inflected word forms, present a compact, single-model approach
备注: Published in the proceedings of TSD 2025. 12 pages, 1 figure, 4 tables
点击查看摘要
Abstract:We present a compact, single-model approach to multilingual inflection, the task of generating inflected word forms from base lemmas to express grammatical categories. Our model, trained jointly on data from 73 languages, is lightweight, robust to unseen words, and outperforms monolingual baselines in most languages. This demonstrates the effectiveness of multilingual modeling for inflection and highlights its practical benefits: simplifying deployment by eliminating the need to manage and retrain dozens of separate monolingual models. In addition to the standard SIGMORPHON shared task benchmarks, we evaluate our monolingual and multilingual models on 73 Universal Dependencies (UD) treebanks, extracting lemma-tag-form triples and their frequency counts. To ensure realistic data splits, we introduce a novel frequency-weighted, lemma-disjoint train-dev-test resampling procedure. Our work addresses the lack of an open-source, general-purpose, multilingual morphological inflection system capable of handling unseen words across a wide range of languages, including Czech. All code is publicly released at: this https URL.
40. 【2510.23104】Leveraging Hierarchical Organization for Medical Multi-document Summarization
链接:https://arxiv.org/abs/2510.23104
作者:Yi-Li Hsu,Katelyn X. Mei,Lucy Lu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:managing cross-document relationships, requires effectively managing, effectively managing cross-document, cross-document relationships, Medical multi-document summarization
备注:
点击查看摘要
Abstract:Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
41. 【2510.23090】MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
链接:https://arxiv.org/abs/2510.23090
作者:Suchan Lee,Jihoon Choi,Sohyeon Lee,Minseok Song,Bong-Gyu Jang,Hwanjo Yu,Soyeon Caren Han
类目:Computation and Language (cs.CL)
关键词:pretrained large language, aligning numerical inputs, large language models, LLM embedding spaces, advances have investigated
备注:
点击查看摘要
Abstract:Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.
42. 【2510.23081】A Survey on LLM Mid-training
链接:https://arxiv.org/abs/2510.23081
作者:Chengying Tu,Xuemiao Zhang,Rongxiang Weng,Rumei Li,Chen Zhang,Yang Bai,Hongfei Yan,Jingang Wang,Xunliang Cai
类目:Computation and Language (cs.CL)
关键词:Recent advances, pre-training and post-training, advances in foundation, highlighted the significant, significant benefits
备注:
点击查看摘要
Abstract:Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.
43. 【2510.23074】Fast-MIA: Efficient and Scalable Membership Inference for LLMs
链接:https://arxiv.org/abs/2510.23074
作者:Hiromu Takahashi,Shotaro Ishihara
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, efficiently evaluating membership, https URL
备注:
点击查看摘要
Abstract:We propose Fast-MIA (this https URL), a Python library for efficiently evaluating membership inference attacks (MIA) against Large Language Models (LLMs). MIA against LLMs has emerged as a crucial challenge due to growing concerns over copyright, security, and data privacy, and has attracted increasing research attention. However, the progress of this research is significantly hindered by two main obstacles: (1) the high computational cost of inference in LLMs, and (2) the lack of standardized and maintained implementations of MIA methods, which makes large-scale empirical comparison difficult. To address these challenges, our library provides fast batch inference and includes implementations of representative MIA methods under a unified evaluation framework. This library supports easy implementation of reproducible benchmarks with simple configuration and extensibility. We release Fast-MIA as an open-source (Apache License 2.0) tool to support scalable and transparent research on LLMs.
44. 【2510.23070】Quality-Aware Translation Tagging in Multilingual RAG system
链接:https://arxiv.org/abs/2510.23070
作者:Hoyeon Moon,Byeolhee Kim,Nikhil Verma
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:retrieves English documents, retrieves English, Multilingual Retrieval-Augmented Generation, Retrieval-Augmented Generation, English documents
备注: EMNLP 2025 MRL Workshop
点击查看摘要
Abstract:Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English documents and translates them into the query language for low-resource settings. However, poor translation quality degrades response generation performance. Existing approaches either assume sufficient translation quality or utilize the rewriting method, which introduces factual distortion and hallucinations. To mitigate these problems, we propose Quality-Aware Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation quality along three dimensions-semantic equivalence, grammatical accuracy, and naturalnessfluency-and attach these scores as metadata without altering the original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs ranging from 2.4B to 14B parameters, covering two low-resource languages (Korean and Finnish) and one high-resource language (Chinese). QTT-RAG outperforms the baselines by preserving factual integrity while enabling generator models to make informed decisions based on translation reliability. This approach allows for effective usage of cross-lingual documents in low-resource settings with limited native language documents, offering a practical and robust solution across multilingual domains.
45. 【2510.23052】Knocking-Heads Attention
链接:https://arxiv.org/abs/2510.23052
作者:Zhanchao Zhou,Xiaodong Chen,Haoxing Chen,Zhenzhong Lan,Jianguo Li
类目:Computation and Language (cs.CL)
关键词:modern large language, enhancing representational capacity, Multi-head attention, large language models, enhancing representational
备注:
点击查看摘要
Abstract:Multi-head attention (MHA) has become the cornerstone of modern large language models, enhancing representational capacity through parallel attention heads. However, increasing the number of heads inherently weakens individual head capacity, and existing attention mechanisms - whether standard MHA or its variants like grouped-query attention (GQA) and grouped-tied attention (GTA) - simply concatenate outputs from isolated heads without strong interaction. To address this limitation, we propose knocking-heads attention (KHA), which enables attention heads to "knock" on each other - facilitating cross-head feature-level interactions before the scaled dot-product attention. This is achieved by applying a shared, diagonally-initialized projection matrix across all heads. The diagonal initialization preserves head-specific specialization at the start of training while allowing the model to progressively learn integrated cross-head representations. KHA adds only minimal parameters and FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention variants. We validate KHA by training a 6.1B parameter MoE model (1.01B activated) on 1T high-quality tokens. Compared to baseline attention mechanisms, KHA brings superior and more stable training dynamics, achieving better performance across downstream tasks.
46. 【2510.23038】Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
链接:https://arxiv.org/abs/2510.23038
作者:Ran Xu,Jingjing Chen,Jiayu Ye,Yu Wu,Jun Yan,Carl Yang,Hongkun Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, evaluate response quality, Language Models, response quality
备注: Work in Progress
点击查看摘要
Abstract:Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
47. 【2510.23027】owards Stable and Effective Reinforcement Learning for Mixture-of-Experts
链接:https://arxiv.org/abs/2510.23027
作者:Di Zhang,Xun Wu,Shaohan Huang,Yaru Hao,Li Dong,Zewen Chi,Zhifang Sui,Furu Wei
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Recent advances, reinforcement learning, leading to significant, reasoning ability, large-scale language models
备注:
点击查看摘要
Abstract:Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
48. 【2510.23023】UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
链接:https://arxiv.org/abs/2510.23023
作者:Huixuan Zhang,Xiaojun Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:generative models, significant concern, rapid proliferation, authenticity of digital, image generative models
备注:
点击查看摘要
Abstract:With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.
49. 【2510.23020】M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
链接:https://arxiv.org/abs/2510.23020
作者:Huixuan Zhang,Xiaojun Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:struggle with generating, generating images, images that perfectly, textual prompts, perfectly align
备注:
点击查看摘要
Abstract:Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
50. 【2510.23011】LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models
链接:https://arxiv.org/abs/2510.23011
作者:Sammriddh Gupta,Sonit Singh,Aditya Joshi,Mira Kim
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Language educators strive, Large Language Models, educators strive, strive to create, create a rich
备注: 14 pages
点击查看摘要
Abstract:Language educators strive to create a rich experience for learners, while they may be restricted in the extend of feedback and practice they can provide. We present the design and development of LangLingual, a conversational agent built using the LangChain framework and powered by Large Language Models. The system is specifically designed to provide real-time, grammar-focused feedback, generate context-aware language exercises and track learner proficiency over time. The paper discusses the architecture, implementation and evaluation of LangLingual in detail. The results indicate strong usability, positive learning outcomes and encouraging learner engagement.
51. 【2510.23006】Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
链接:https://arxiv.org/abs/2510.23006
作者:Shenran Wang,Timothy Tin-Long Tse,Jian Zhu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:hybrid large language, large language models, perform in-depth evaluations, knowledge-based ICL tasks, in-context learning
备注:
点击查看摘要
Abstract:We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.
52. 【2510.22993】Can Language Models Compose Skills In-Context?
链接:https://arxiv.org/abs/2510.22993
作者:Zidong Liu,Zhuoyan Xu,Zhenmei Shi,Yingyu Liang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Composing basic skills, modern intelligent systems, Composing basic, accomplish composite tasks, intelligent systems
备注:
点击查看摘要
Abstract:Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perform composite tasks that combine basic skills demonstrated in in-context examples. This is more challenging than the standard setting, where skills and their composition can be learned in training. We conduct systematic experiments on various representative open-source language models, utilizing linguistic and logical tasks designed to probe composition abilities. The results reveal that simple task examples can have a surprising negative impact on the performance, because the models generally struggle to recognize and assemble the skills correctly, even with Chain-of-Thought examples. Theoretical analysis further shows that it is crucial to align examples with the corresponding steps in the composition. This inspires a method for the probing tasks, whose improved performance provides positive support for our insights.
53. 【2510.22968】Measuring Teaching with LLMs
链接:https://arxiv.org/abs/2510.22968
作者:Michael Hardy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:challenge in education, teaching quality, persistent challenge, Large Language Models, Large Language
备注:
点击查看摘要
Abstract:Objective and scalable measurement of teaching quality is a persistent challenge in education. While Large Language Models (LLMs) offer potential, general-purpose models have struggled to reliably apply complex, authentic classroom observation instruments. This paper uses custom LLMs built on sentence-level embeddings, an architecture better suited for the long-form, interpretive nature of classroom transcripts than conventional subword tokenization. We systematically evaluate five different sentence embeddings under a data-efficient training regime designed to prevent overfitting. Our results demonstrate that these specialized models can achieve human-level and even super-human performance with expert human ratings above 0.65 and surpassing the average human-human rater correlation. Further, through analysis of annotation context windows, we find that more advanced models-those better aligned with human judgments-attribute a larger share of score variation to lesson-level features rather than isolated utterances, challenging the sufficiency of single-turn annotation paradigms. Finally, to assess external validity, we find that aggregate model scores align with teacher value-added measures, indicating they are capturing features relevant to student learning. However, this trend does not hold at the individual item level, suggesting that while the models learn useful signals, they have not yet achieved full generalization. This work establishes a viable and powerful new methodology for AI-driven instructional measurement, offering a path toward providing scalable, reliable, and valid feedback for educator development.
54. 【2510.22967】MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
链接:https://arxiv.org/abs/2510.22967
作者:Yucheng Ning,Xixun Lin,Fang Fang,Yanan Cao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, raises critical concerns, adoption of Large, Language Models
备注: This article has been accepted by Frontiers of Computer Science (FCS)
点击查看摘要
Abstract:The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
55. 【2510.22956】agging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts
链接:https://arxiv.org/abs/2510.22956
作者:Anwesan Pal,Karen Hovsepian,Tinghao Guo,Mengnan Zhao,Somendra Tripathi,Nikos Kanakaris,George Mihaila,Sumit Nigam
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:effective question answering, large language models, modern flagship large, flagship large language, revealed major limitations
备注: Paper accepted at EMNLP 2025
点击查看摘要
Abstract:Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at this https URL.
56. 【2510.22954】Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
链接:https://arxiv.org/abs/2510.22954
作者:Liwei Jiang,Yuanjun Chai,Margaret Li,Mickel Liu,Raymond Fok,Nouha Dziri,Yulia Tsvetkov,Maarten Sap,Alon Albalak,Yejin Choi
类目:Computation and Language (cs.CL)
关键词:human-like creative content, Language models, human-like creative, creative content, raising concerns
备注: NeurIPS 2025 DB Paper (Oral); Camera-Ready Version
点击查看摘要
Abstract:Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar outputs. Yet scalable methods for evaluating LM output diversity remain limited, especially beyond narrow tasks such as random number or name generation, or beyond repeated sampling from a single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries that admit a wide range of plausible answers with no single ground truth. We introduce the first comprehensive taxonomy for characterizing the full spectrum of open-ended prompts posed to LMs, comprising 6 top-level categories (e.g., brainstorm ideation) that further breaks down to 17 subcategories. Using Infinity-Chat, we present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect in open-ended generation of LMs, characterized by (1) intra-model repetition, where a single model consistently generates similar responses, and more so (2) inter-model homogeneity, where different models produce strikingly similar outputs. Infinity-Chat also includes 31,250 human annotations, across absolute ratings and pairwise preferences, with 25 independent human annotations per example. This enables studying collective and individual-specific human preferences in response to open-ended queries. Our findings show that LMs, reward models, and LM judges are less well calibrated to human ratings on model generations that elicit differing idiosyncratic annotator preferences, despite maintaining comparable overall quality. Overall, INFINITY-CHAT presents the first large-scale resource for systematically studying real-world open-ended queries to LMs, revealing critical insights to guide future research for mitigating long-term AI safety risks posed by the Artificial Hivemind.
57. 【2510.22907】Language Server CLI Empowers Language Agents with Process Rewards
链接:https://arxiv.org/abs/2510.22907
作者:Yifan Zhang,Lanser Contributors
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
关键词:Large language models, servers compute verified, models routinely hallucinate, routinely hallucinate APIs, language models routinely
备注: Project Page: [this https URL](https://github.com/yifanzhang-pro/lanser-cli)
点击查看摘要
Abstract:Large language models routinely hallucinate APIs and mislocalize edits, while language servers compute verified, IDE-grade facts about real code. We present Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language Server Protocol (LSP) server for coding agents and CI, exposing deterministic, replayable workflows. Our position is that language servers provide not only structural information (definitions, references, types, diagnostics) but also an actionable process reward: machine-checked, step-wise signals that align an agent's planning loop with program reality. In this work, Lanser-CLI contributes: (i) a robust addressing scheme beyond brittle "file:line:col" via a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a principled relocation algorithm; (ii) deterministic Analysis Bundles that normalize Language Server responses and capture environment/capability metadata with stable content hashes; (iii) a safety envelope for mutating operations (rename, code actions) with preview, workspace jails, and Git-aware, transactional apply; and (iv) a process-reward functional derived from Language Server facts (diagnostic deltas, disambiguation confidence, and safe-apply checks) that is computable online and replayable offline. We formalize determinism under frozen snapshots and establish a monotonicity property for the process reward, making it suitable for process supervision and counterfactual analysis. Project Page: this https URL
58. 【2510.22904】Modeling Political Discourse with Sentence-BERT and BERTopic
链接:https://arxiv.org/abs/2510.22904
作者:Margarida Mendonca,Alvaro Figueira
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Moral Foundations Theory, ideological divides, reshaped political discourse, politicians a platform, platform for direct
备注: 11 pages. Continues previous study by Mendonca M. and Figueira A, 2023: "Analyzing Political Discourse in the 117th U.S. Congress Using Transformer-Based Topic Models", presented at the International Conference on Computational Social Science
点击查看摘要
Abstract:Social media has reshaped political discourse, offering politicians a platform for direct engagement while reinforcing polarization and ideological divides. This study introduces a novel topic evolution framework that integrates BERTopic-based topic modeling with Moral Foundations Theory (MFT) to analyze the longevity and moral dimensions of political topics in Twitter activity during the 117th U.S. Congress. We propose a methodology for tracking dynamic topic shifts over time and measuring their association with moral values and quantifying topic persistence. Our findings reveal that while overarching themes remain stable, granular topics tend to dissolve rapidly, limiting their long-term influence. Moreover, moral foundations play a critical role in topic longevity, with Care and Loyalty dominating durable topics, while partisan differences manifest in distinct moral framing strategies. This work contributes to the field of social network analysis and computational political discourse by offering a scalable, interpretable approach to understanding moral-driven topic evolution on social media.
59. 【2510.22881】Offline Preference Optimization via Maximum Marginal Likelihood Estimation
链接:https://arxiv.org/abs/2510.22881
作者:Saeed Najafi,Alona Fyshe
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Aligning Large Language, Aligning Large, Reinforcement Learning, Human Feedback, Large Language Models
备注:
点击查看摘要
Abstract:Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter $\beta$ compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model's general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO's implicit preference optimization within the gradient updates.
60. 【2510.22876】Batch Speculative Decoding Done Right
链接:https://arxiv.org/abs/2510.22876
作者:Ranran Haoran Zhang,Soumik Dey,Ashirbad Mishra,Hansi Wu,Binbin Li,Rui Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:small draft model, speeds up LLM, propose multiple tokens, target model verifies, verifies in parallel
备注:
点击查看摘要
Abstract:Speculative decoding speeds up LLM inference by using a small draft model to propose multiple tokens that a target model verifies in parallel. Extending this idea to batches is essential for production serving, but it introduces the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, breaking right-alignment and corrupting position IDs, attention masks, and KV-cache state. We show that several existing batch implementations violate output equivalence-the fundamental requirement that speculative decoding must produce identical token sequences to standard autoregressive generation. These violations occur precisely due to improper handling of the ragged tensor problem. In response, we (1) characterize the synchronization requirements that guarantee correctness, (2) present a correctness-first batch speculative decoding EQSPEC that exposes realignment as consuming 40% of overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences and dynamically forms same-length groups, to reduce the realignment overhead while preserving per-sequence speculative speedups. On the SpecBench dataset, across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our approach achieves up to 3$\times$ throughput improvement at batch size 8 compared to batch size 1, with efficient scaling through batch size 8, while maintaining 95% output equivalence. Our method requires no custom kernels and integrates cleanly with existing inference stacks. Our code is available at this https URL.
61. 【2510.22874】A Comprehensive Dataset for Human vs. AI Generated Text Detection
链接:https://arxiv.org/abs/2510.22874
作者:Rajarshi Roy,Nasrin Imanpour,Ashhar Aziz,Shashwat Bajpai,Gurpreet Singh,Shwetangshu Biswas,Kapil Wanaskar,Parth Patwa,Subhankar Ghosh,Shreyas Dixit,Nilesh Ranjan Pal,Vipula Rawte,Ritvik Garimella,Gaytri Jena,Amit Sheth,Vasu Sharma,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha,Amitava Das
类目:Computation and Language (cs.CL)
关键词:increasingly human-like AI-generated, large language models, human-like AI-generated text, raising concerns, rapid advancement
备注: Defactify4 @AAAI 2025
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: this https URL.
62. 【2510.22866】Interpreting and Mitigating Unwanted Uncertainty in LLMs
链接:https://arxiv.org/abs/2510.22866
作者:Tiasa Singha Roy,Ayush Rajesh Jhaveri,Ilias Triantafyllopoulos
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, previously correct answer, Language Models, exhibit unwanted uncertainty
备注:
点击查看摘要
Abstract:Despite their impressive capabilities, Large Language Models (LLMs) exhibit unwanted uncertainty, a phenomenon where a model changes a previously correct answer into an incorrect one when re-prompted. This behavior undermines trust and poses serious risks in high-stakes domains. In this work, we investigate the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack retrieval framework and integrate a Flip-style re-evaluation prompt to simulate realistic answer-flipping scenarios. We find that retrieval heads are not primarily responsible for avoiding uncertainty. Instead, we identify a small set of non-retrieval attention heads that disproportionately attend to misleading tokens in uncertain contexts. Masking these heads yields significant improvements, reducing flip behavior by up to 15% without introducing incoherence or overcorrection. However, when tested for downstream tasks, we observe trade-offs with flip behavior. Our findings contribute to the growing field of mechanistic interpretability and present a simple yet effective technique for mitigating uncertainty-driven failure modes in LLMs.
63. 【2510.22860】Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement
链接:https://arxiv.org/abs/2510.22860
作者:Linyang He,Tianjun Zhong,Richard Antonello,Gavin Mischler,Micah Goldblum,Nima Mesgarani
类目:Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
关键词:human brain progresses, performing high-level reasoning, simple linguistic inputs, challenge in neuroscience, inputs to performing
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly "entangled," mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas. 2) The neural signature for reasoning is temporally distinct, peaking later (~350-400ms) than signals related to lexicon, syntax, and meaning, consistent with its position atop a processing hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as their predictive success is primarily attributable to linguistically shallow features, masking the more subtle contributions of deeper cognitive processing.
64. 【2510.22849】Once Upon an Input: Reasoning via Per-Instance Program Synthesis
链接:https://arxiv.org/abs/2510.22849
作者:Adam Stein,Neelay Velingker,Mayur Naik,Eric Wong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, language models, excel at zero-shot, struggle with complex
备注: Accepted at NeurIPS 2025. 34 pages, 7 figures
点击查看摘要
Abstract:Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.
65. 【2510.22844】Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
链接:https://arxiv.org/abs/2510.22844
作者:Prerna Ravi,Dong Won Lee,Beatriz Flamia,Jasmine David,Brandon Hanks,Cynthia Breazeal,Emma Anderson,Grace Lin
类目:Computation and Language (cs.CL)
关键词:Understanding how ideas, analyzing collaborative learning, ideas develop, develop and flow, flow in small-group
备注: In Submission: Journal of Educational Data Mining (jEDM) 2026
点击查看摘要
Abstract:Understanding how ideas develop and flow in small-group conversations is critical for analyzing collaborative learning. A key structural feature of these interactions is threading, the way discourse talk naturally organizes into interwoven topical strands that evolve over time. While threading has been widely studied in asynchronous text settings, detecting threads in synchronous spoken dialogue remains challenging due to overlapping turns and implicit cues. At the same time, large language models (LLMs) show promise for automating discourse analysis but often struggle with long-context tasks that depend on tracing these conversational links. In this paper, we investigate whether explicit thread linkages can improve LLM-based coding of relational moves in group talk. We contribute a systematic guidebook for identifying threads in synchronous multi-party transcripts and benchmark different LLM prompting strategies for automated threading. We then test how threading influences performance on downstream coding of conversational analysis frameworks, that capture core collaborative actions such as agreeing, building, and eliciting. Our results show that providing clear conversational thread information improves LLM coding performance and underscores the heavy reliance of downstream analysis on well-structured dialogue. We also discuss practical trade-offs in time and cost, emphasizing where human-AI hybrid approaches can yield the best value. Together, this work advances methods for combining LLMs and robust conversational thread structures to make sense of complex, real-time group interactions.
66. 【2510.22830】Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays
链接:https://arxiv.org/abs/2510.22830
作者:Haowei Hua(1),Hong Jiao(2),Xinyi Wang(3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park amp; Beijing Normal University)
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:automated scoring, variants are extensively, extensively explored, Agency Lab Automated, Automated Essay Scoring
备注: 19 pages, 5 Tables 7 Figures, Presentation at Artificial Intelligence in Measurement and Education Conference (AIME-Con)
点击查看摘要
Abstract:BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.
67. 【2510.22823】Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP
链接:https://arxiv.org/abs/2510.22823
作者:Poli Nemkova,Amrit Adhikari,Matthew Pearson,Vamsi Krishna Sadu,Mark V. Albert
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:costly commercial APIs, critical choice, invest in costly, human rights monitoring, Humanitarian organizations face
备注:
点击查看摘要
Abstract:Humanitarian organizations face a critical choice: invest in costly commercial APIs or rely on free open-weight models for multilingual human rights monitoring. While commercial systems offer reliability, open-weight alternatives lack empirical validation -- especially for low-resource languages common in conflict zones. This paper presents the first systematic comparison of commercial and open-weight large language models (LLMs) for human-rights-violation detection across seven languages, quantifying the cost-reliability trade-off facing resource-constrained organizations. Across 78,000 multilingual inferences, we evaluate six models -- four instruction-aligned (Claude-Sonnet-4, DeepSeek-V3, Gemini-Flash-2.0, GPT-4.1-mini) and two open-weight (LLaMA-3-8B, Mistral-7B) -- using both standard classification metrics and new measures of cross-lingual reliability: Calibration Deviation (CD), Decision Bias (B), Language Robustness Score (LRS), and Language Stability Score (LSS). Results show that alignment, not scale, determines stability: aligned models maintain near-invariant accuracy and balanced calibration across typologically distant and low-resource languages (e.g., Lingala, Burmese), while open-weight models exhibit significant prompt-language sensitivity and calibration drift. These findings demonstrate that multilingual alignment enables language-agnostic reasoning and provide practical guidance for humanitarian organizations balancing budget constraints with reliability in multilingual deployment.
68. 【2510.22798】VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
链接:https://arxiv.org/abs/2510.22798
作者:Thu Phuong Nguyen,Duc M. Nguyen,Hyotaek Jeon,Hyunwook Lee,Hyunmin Song,Sungahn Ko,Taehwan Kim
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Automatically assessing handwritten, assessing handwritten mathematical, handwritten mathematical solutions, significant challenge due, Evaluating Handwritten Mathematics
备注: EMNLP 2025. Project Website: [this https URL](https://vehme.github.io/)
点击查看摘要
Abstract:Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
69. 【2510.22780】How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations
链接:https://arxiv.org/abs/2510.22780
作者:Zora Zhiruo Wang,Yijia Shao,Omar Shaikh,Daniel Fried,Graham Neubig,Diyi Yang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:signaling a pressing, continually optimized, pressing trend, trend with significant, significant impacts
备注:
点击查看摘要
Abstract:AI agents are continually optimized for tasks related to human work, such as software engineering and professional writing, signaling a pressing trend with significant impacts on the human workforce. However, these agent developments have often not been grounded in a clear understanding of how humans execute work, to reveal what expertise agents possess and the roles they can play in diverse workflows. In this work, we study how agents do human work by presenting the first direct comparison of human and agent workers across multiple essential work-related skills: data analysis, engineering, computation, writing, and design. To better understand and compare heterogeneous computer-use activities of workers, we introduce a scalable toolkit to induce interpretable, structured workflows from either human or agent computer-use activities. Using such induced workflows, we compare how humans and agents perform the same tasks and find that: (1) While agents exhibit promise in their alignment to human workflows, they take an overwhelmingly programmatic approach across all work domains, even for open-ended, visually dependent tasks like design, creating a contrast with the UI-centric methods typically used by humans. (2) Agents produce work of inferior quality, yet often mask their deficiencies via data fabrication and misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster and cost 90.4-96.2% less than humans, highlighting the potential for enabling efficient collaboration by delegating easily programmable tasks to agents.
70. 【2510.22775】Scalable Supervising Software Agents with Patch Reasoner
链接:https://arxiv.org/abs/2510.22775
作者:Junjielong Xu,Boyin Tan,Xiaoyuan Liu,Chao Peng,Pengfei Gao,Pinjia He
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:advanced software engineering, existing test-based supervision, software engineering tasks, large language model, language model agents
备注:
点击查看摘要
Abstract:While large language model agents have advanced software engineering tasks, the unscalable nature of existing test-based supervision is limiting the potential improvement of data scaling. The reason is twofold: (1) building and running test sandbox is rather heavy and fragile, and (2) data with high-coverage tests is naturally rare and threatened by test hacking via edge cases. In this paper, we propose R4P, a patch verifier model to provide scalable rewards for training and testing SWE agents via reasoning. We consider that patch verification is fundamentally a reasoning task, mirroring how human repository maintainers review patches without writing and running new reproduction tests. To obtain sufficient reference and reduce the risk of reward hacking, R4P uses a group-wise objective for RL training, enabling it to verify multiple patches against each other's modification and gain a dense reward for stable training. R4P achieves 72.2% Acc. for verifying patches from SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P's practicality, we design and train a lite scaffold, Mini-SE, with pure reinforcement learning where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2% Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original Qwen3-32B. This can be further improved to 32.8% with R4P for test-time scaling. Furthermore, R4P verifies patches within a second, 50x faster than testing on average. The stable scaling curves of rewards and accuracy along with high efficiency reflect R4P's practicality.
71. 【2510.22768】MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
链接:https://arxiv.org/abs/2510.22768
作者:Haoyi Qiu,Yilun Zhou,Pranav Narayanan Venkit,Kung-Hsiang Huang,Jiaxin Zhang,Nanyun Peng,Chien-Sheng Wu
类目:Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, pervasive persuasive content, increasingly deployed, deployed in domains
备注:
点击查看摘要
Abstract:As Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news, they are exposed to pervasive persuasive content. A critical question is how these models function as persuadees-how and why they can be influenced by persuasive multimodal inputs. Understanding both their susceptibility to persuasion and the effectiveness of different persuasive strategies is crucial, as overly persuadable models may adopt misleading beliefs, override user preferences, or generate unethical or unsafe outputs when exposed to manipulative messages. We introduce MMPersuade, a unified framework for systematically studying multimodal persuasion dynamics in LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs images and videos with established persuasion principles across commercial, subjective and behavioral, and adversarial contexts, and (ii) an evaluation framework that quantifies both persuasion effectiveness and model susceptibility via third-party agreement scoring and self-estimated token probabilities on conversation histories. Our study of six leading LVLMs as persuadees yields three key insights: (i) multimodal inputs substantially increase persuasion effectiveness-and model susceptibility-compared to text alone, especially in misinformation scenarios; (ii) stated prior preferences decrease susceptibility, yet multimodal information maintains its persuasive advantage; and (iii) different strategies vary in effectiveness across contexts, with reciprocity being most potent in commercial and subjective contexts, and credibility and logic prevailing in adversarial contexts. By jointly analyzing persuasion effectiveness and susceptibility, MMPersuade provides a principled foundation for developing models that are robust, preference-consistent, and ethically aligned when engaging with persuasive multimodal content.
72. 【2510.22767】ELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
链接:https://arxiv.org/abs/2510.22767
作者:Omar Naim,Krish Sharma,Nicholas Asher
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Task-Aware Layer Elimination, directly optimizing task-specific, optimizing task-specific validation, LLM by directly, task-specific validation performance
备注:
点击查看摘要
Abstract:In this paper we introduce Tale, Task-Aware Layer Elimination, an inference-time algorithm that prunes entire transformer layers in an LLM by directly optimizing task-specific validation performance. We evaluate TALE on 9 tasks and 5 models, including LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral 7B, and Lucie 7B, under both zero-shot and few-shot settings. Unlike prior approaches, TALE requires no retraining and consistently improves accuracy while reducing computational cost across all benchmarks. Furthermore, applying TALE during finetuning leads to additional performance gains. Finally, TALE provides flexible user control over trade-offs between accuracy and efficiency. Mutual information analysis shows that certain layers act as bottlenecks, degrading task-relevant representations. Tale's selective layer removal remedies this problem, producing smaller, faster, and more accurate models that are also faster to fine-tune while offering new insights into transformer interpretability.
73. 【2510.22763】Iterative Layer Pruning for Efficient Translation Inference
链接:https://arxiv.org/abs/2510.22763
作者:Yasmin Moslem,Muhammad Hazim Al Farouq,John D. Kelleher
类目:Computation and Language (cs.CL); Performance (cs.PF)
关键词:natural language processing, Large language models, Large language, including machine translation, language processing
备注: WMT 2025
点击查看摘要
Abstract:Large language models (LLMs) have transformed many areas of natural language processing, including machine translation. However, efficient deployment of LLMs remains challenging due to their intensive computational requirements. In this paper, we address this challenge and present our submissions to the Model Compression track at the Conference on Machine Translation (WMT 2025). In our experiments, we investigate iterative layer pruning guided by layer importance analysis. We evaluate this method using the Aya-Expanse-8B model for translation from Czech to German, and from English to Egyptian Arabic. Our approach achieves substantial reductions in model size and inference time, while maintaining the translation quality of the baseline models.
74. 【2510.22758】EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
链接:https://arxiv.org/abs/2510.22758
作者:Li Zhou,Lutong Yu,You Lyu,Yihang Lin,Zefeng Zhao,Junyi Ao,Yuhao Zhang,Benyou Wang,Haizhou Li
类目:Computation and Language (cs.CL)
关键词:made significant progress, spoken language understanding, Speech Language Models, spoken language, language understanding
备注: Speech Language Models, Spoken Language Understanding, Vocal Cue Perception, Empathetic Dialogue, Benchmark Evaluation
点击查看摘要
Abstract:Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
75. 【2510.22752】Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models
链接:https://arxiv.org/abs/2510.22752
作者:Anooshka Bajaj,Deven Mahesh Mistry,Sahaj Singh Maini,Yash Aggarwal,Zoran Tiganj
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, retrieve contextual information, shaping how Large, Language Models
备注:
点击查看摘要
Abstract:In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.
76. 【2510.22751】Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models
链接:https://arxiv.org/abs/2510.22751
作者:Piyushkumar Patel
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, confidently generate false, generate false information, Language Models
备注:
点击查看摘要
Abstract:While Large Language Models have transformed how we interact with AI systems, they suffer from a critical flaw: they confidently generate false information that sounds entirely plausible. This hallucination problem has become a major barrier to deploying these models in real-world applications where accuracy matters. We developed a fact verification framework that catches and corrects these errors in real-time by cross checking LLM outputs against multiple knowledge sources. Our system combines structured databases, live web searches, and academic literature to verify factual claims as they're generated. When we detect inconsistencies, we automatically correct them while preserving the natural flow of the response. Testing across various domains showed we could reduce hallucinations by 67% without sacrificing response quality. Domain experts in healthcare, finance, and scientific research rated our corrected outputs 89% satisfactory a significant improvement over unverified LLM responses. This work offers a practical solution for making LLMs more trustworthy in applications where getting facts wrong isn't an option.
77. 【2510.22747】Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
链接:https://arxiv.org/abs/2510.22747
作者:Eeham Khan,Firas Saidani,Owen Van Esbroeck,Richard Khoury,Leila Kosseim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:strongest capabilities remain, capabilities remain largely, remain largely confined, abundant training data, widespread adoption
备注: Submitted to LREC 2026
点击查看摘要
Abstract:Despite the widespread adoption of large language models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with under 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. We release the first Québec French LLMs on HuggingFace.
78. 【2510.22739】REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
链接:https://arxiv.org/abs/2510.22739
作者:Yiwen Tang,Qiuyu Zhao,Zenghui Sun,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Taobao e-commerce visual, Taobao e-commerce, behavior analysis reveals, e-commerce visual search, implicit intent
备注:
点击查看摘要
Abstract:In Taobao e-commerce visual search, user behavior analysis reveals a large proportion of no-click requests, suggesting diverse and implicit user intents. These intents are expressed in various forms and are difficult to mine and discover, thereby leading to the limited adaptability and lag in platform strategies. This greatly restricts users' ability to express diverse intents and hinders the scalability of the visual search system. This mismatch between user implicit intent expression and system response defines the User-SearchSys Intent Discrepancy. To alleviate the issue, we propose a novel framework REVISION. This framework integrates offline reasoning mining with online decision-making and execution, enabling adaptive strategies to solve implicit user demands. In the offline stage, we construct a periodic pipeline to mine discrepancies from historical no-click requests. Leveraging large models, we analyze implicit intent factors and infer optimal suggestions by jointly reasoning over query and product metadata. These inferred suggestions serve as actionable insights for refining platform strategies. In the online stage, REVISION-R1-3B, trained on the curated offline data, performs holistic analysis over query images and associated historical products to generate optimization plans and adaptively schedule strategies across the search pipeline. Our framework offers a streamlined paradigm for integrating large models with traditional search systems, enabling end-to-end intelligent optimization across information aggregation and user interaction. Experimental results demonstrate that our approach improves the efficiency of implicit intent mining from large-scale search logs and significantly reduces the no-click rate.
79. 【2510.22733】$\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
链接:https://arxiv.org/abs/2510.22733
作者:Qi Liu,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Jiaxin Mao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:real-world search applications, search applications, fundamental component, component in real-world, real-world search
备注: Code and models are avaliable at [this https URL](https://alibaba-nlp.github.io/E2Rank)
点击查看摘要
Abstract:Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
80. 【2510.22732】ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation
链接:https://arxiv.org/abs/2510.22732
作者:Jiali Cheng,Anjishnu Kumar,Roshan Lal,Rishi Rajasekaran,Hani Ramezani,Omar Zia Khan,Oleg Rokhlenko,Sunny Chiu-Webster,Gang Hua,Hadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Robotics (cs.RO)
关键词:produce inefficient execution, inefficient execution plans, execution plans due, neural network fine-tuning, Look-ahead Action Simulation
备注: 9 pages, NeurIPS 2025 Workshop on Language Agents and World Models
点击查看摘要
Abstract:We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution plans due to a lack of awareness of the structure and dynamics of the new environment. To address this limitation, we introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented agent that is able to make plans grounded in a model of the environment by simulating the consequences of those actions in cognitive space. Our agent starts by building a "cognitive map" by performing a lightweight curiosity driven exploration of the environment. The planner proposes candidate actions; the simulator predicts their consequences in cognitive space; a critic analyzes the options to select the best roll-out and update the original plan; and a browser executor performs the chosen action. On the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9% success rate for the previously published state-of-the-art. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablations show sizable drops without the world-model, hierarchical planner, and look-ahead-based replanner confirming their complementary roles within the design of our system
81. 【2510.22729】Critical Insights into Leading Conversational AI Models
链接:https://arxiv.org/abs/2510.22729
作者:Urja Kohli(1),Aditi Singh(2),Arun Sharma(3) ((1) Department of Mechanical and Automation Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India, (2) Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, Delhi, India, (3) Department of Information Technology, Indira Gandhi Delhi Technical University for Women, Delhi, India)
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Big Language Models, Big Language, Language Models, businesses use software, OpenAI GPT models
备注: 21 pages, 7 tables, 3 figures. Open-access preprint intended for journal or conference submission
点击查看摘要
Abstract:Big Language Models (LLMs) are changing the way businesses use software, the way people live their lives and the way industries work. Companies like Google, High-Flyer, Anthropic, OpenAI and Meta are making better LLMs. So, it's crucial to look at how each model is different in terms of performance, moral behaviour and usability, as these differences are based on the different ideas that built them. This study compares five top LLMs: Google's Gemini, High-Flyer's DeepSeek, Anthropic's Claude, OpenAI's GPT models and Meta's LLaMA. It performs this by analysing three important factors: Performance and Accuracy, Ethics and Bias Mitigation and Usability and Integration. It was found that Claude has good moral reasoning, Gemini is better at multimodal capabilities and has strong ethical frameworks. DeepSeek is great at reasoning based on facts, LLaMA is good for open applications and ChatGPT delivers balanced performance with a focus on usage. It was concluded that these models are different in terms of how well they work, how easy they are to use and how they treat people ethically, making it a point that each model should be utilised by the user in a way that makes the most of its strengths.
82. 【2510.22694】Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.22694
作者:Shu Zhao,Tianyi Shen,Nilesh Ahuja,Omesh Tickoo,Vijaykrishnan Narayanan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Multimodal Large Language, Multimodal Retrieval-Augmented Generation, Language Models, Multimodal Large
备注: Accepted at NeurIPS 2025 UniReps Workshop
点击查看摘要
Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
83. 【2510.22691】SALSA: Single-pass Autoregressive LLM Structured Classification
链接:https://arxiv.org/abs/2510.22691
作者:Ruslan Berdichevsky,Shai Nahum-Gefen,Elad Ben Zaken
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:instruction-tuned Large Language, Large Language Models, impressive generalization capabilities, Large Language, instruction-tuned Large
备注:
点击查看摘要
Abstract:Despite their impressive generalization capabilities, instruction-tuned Large Language Models often underperform on text classification benchmarks. We introduce SALSA, a coherent pipeline that combines structured prompting, class-to-token mapping, and parameter-efficient fine-tuning, thereby avoiding cold-start training. Each class label is mapped to a distinct output token, and prompts are constructed to elicit a single-token response. During inference, the model's output is projected only onto the logits of the relevant class tokens, enabling efficient and accurate classification in a single forward pass. SALSA achieves state-of-the-art results across diverse benchmarks, demonstrating its robustness and scalability for LLM-based classification applications.
84. 【2510.22689】Rule-Based Explanations for Retrieval-Augmented LLM Systems
链接:https://arxiv.org/abs/2510.22689
作者:Joel Rorseth,Parke Godfrey,Lukasz Golab,Divesh Srivastava,Jarek Szlichta
类目:Computation and Language (cs.CL)
关键词:machine learning models, explain machine learning, If-then rules, loan application, machine learning
备注:
点击查看摘要
Abstract:If-then rules are widely used to explain machine learning models; e.g., "if employed = no, then loan application = rejected." We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., "if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first." To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions' value and efficiency.
85. 【2510.22684】RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance
链接:https://arxiv.org/abs/2510.22684
作者:Jiuniu Wang,Gongjie Zhang,Quanhao Qian,Junlong Gao,Deli Zhao,Ran Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Scalable Vector Graphics, Scalable Vector, Vector Graphics, robot control, fundamental to digital
备注: 15 pages, 5 figures
点击查看摘要
Abstract:Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.
86. 【2510.22679】Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration
链接:https://arxiv.org/abs/2510.22679
作者:Yuval Kainan,Shaked Zychlinski
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, resources generating boilerplate, computational resources generating, adds unnecessary cost, Large Language
备注: 13 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) often expend significant computational resources generating boilerplate responses, such as refusals, simple acknowledgements and casual greetings, which adds unnecessary cost and latency. To address this inefficiency, we propose a simple yet highly effective method for detecting such responses after only a single generation step. We demonstrate that the log-probability distribution of the first generated token serves as a powerful signal for classifying the nature of the entire subsequent response. Our experiments, conducted across a diverse range of small, large, and reasoning-specialized models, show that the first-token log-probability vectors form distinctly separable clusters for different response types. Using a lightweight k-NN classifier, we achieve high accuracy in predicting whether a response will be a substantive answer or a form of boilerplate response, including user-specified refusals. The primary implication is a practical, computationally trivial technique, optimizing LLM inference by enabling early termination or redirection to a smaller model, thereby yielding significant savings in computational cost. This work presents a direct path toward more efficient and sustainable LLM deployment.
87. 【2510.22672】Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
链接:https://arxiv.org/abs/2510.22672
作者:Anna Deichler,Jonas Beskow
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:studying referential communication, Meta Project Aria, exocentric perspectives, Project Aria smart, communication across egocentric
备注: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)
点击查看摘要
Abstract:We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
88. 【2510.22656】Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion
链接:https://arxiv.org/abs/2510.22656
作者:Zilong Wang,Qingtian Zeng,Hua Duan,Cheng Cheng,Minghao Zou,Ziyang Wang
类目:Computation and Language (cs.CL)
关键词:Few-shot Knowledge Graph, Knowledge Graph Completion, Few-shot Knowledge, Graph Completion, Knowledge Graph
备注:
点击查看摘要
Abstract:Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsity. To address these challenges, we propose a novel FKGC framework for conjugate relation modeling (CR-FKGC). Specifically, it employs a neighborhood aggregation encoder to integrate higher-order neighbor information, a conjugate relation learner combining an implicit conditional diffusion relation module with a stable relation module to capture stable semantics and uncertainty offsets, and a manifold conjugate decoder for efficient evaluation and inference of missing triples in manifold space. Experiments on three benchmarks demonstrate that our method achieves superior performance over state-of-the-art methods.
89. 【2510.22631】Culturally Grounded Physical Commonsense Reasoning in Italian and English: A Submission to the MRL 2025 Shared Task
链接:https://arxiv.org/abs/2510.22631
作者:Marco De Santis,Lisa Alazraki
类目:Computation and Language (cs.CL)
关键词:Physical Reasoning Datasets, Multilingual Physical Reasoning, physical commonsense reasoning, Shared Task, Reasoning Datasets
备注: MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets
点击查看摘要
Abstract:This paper presents our submission to the MRL 2025 Shared Task on Multilingual Physical Reasoning Datasets. The objective of the shared task is to create manually-annotated evaluation data in the physical commonsense reasoning domain, for languages other than English, following a format similar to PIQA. Our contribution, FormaMentis, is a novel benchmark for physical commonsense reasoning that is grounded in Italian language and culture. The data samples in FormaMentis are created by expert annotators who are native Italian speakers and are familiar with local customs and norms. The samples are additionally translated into English, while preserving the cultural elements unique to the Italian context.
90. 【2510.22629】Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal
链接:https://arxiv.org/abs/2510.22629
作者:Ambalika Guha,Sajal Saha,Debanjan Ballav,Soumi Mitra,Hritwick Chakraborty
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:distinct perspective, language, Preserving linguistic diversity, endangered Toto language, West Bengal
备注:
点击查看摘要
Abstract:Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.
91. 【2510.22616】PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
链接:https://arxiv.org/abs/2510.22616
作者:Morteza Alikhani,Mohammadtaha Bagherifard,Erfan Zinvandi,Mehran Sarmadi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Persian Commonsense Reasoning, Commonsense Reasoning, large-scale Persian benchmark, Persian Commonsense, Embedding Similarity Scoring
备注: 20 pages, 17 figures, Accepted to IJCNLP-AACL 2025 (Main Conference)
点击查看摘要
Abstract:We introduced PerCoR (Persian Commonsense Reasoning), the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural, and other web sources. We introduce a novel conjunction-based segmentation strategy to generate coherent sentence-completion pairs, enabling broad topical and structural diversity. To create challenging distractors, we propose DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering), a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset's difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at this https URL.
92. 【2510.22602】Personal Care Utility (PCU): Building the Health Infrastructure for Everyday Insight and Guidance
链接:https://arxiv.org/abs/2510.22602
作者:Mahyar Abbasian,Ramesh Jain
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:Building on decades, Personal Care Utility, lifelong health guidance, biomedical innovation, decades of success
备注: 22 pages, 2 figures, 1 table, Journal paper
点击查看摘要
Abstract:Building on decades of success in digital infrastructure and biomedical innovation, we propose the Personal Care Utility (PCU) - a cybernetic system for lifelong health guidance. PCU is conceived as a global, AI-powered utility that continuously orchestrates multimodal data, knowledge, and services to assist individuals and populations alike. Drawing on multimodal agents, event-centric modeling, and contextual inference, it offers three essential capabilities: (1) trusted health information tailored to the individual, (2) proactive health navigation and behavior guidance, and (3) ongoing interpretation of recovery and treatment response after medical events. Unlike conventional episodic care, PCU functions as an ambient, adaptive companion - observing, interpreting, and guiding health in real time across daily life. By integrating personal sensing, experiential computing, and population-level analytics, PCU promises not only improved outcomes for individuals but also a new substrate for public health and scientific discovery. We describe the architecture, design principles, and implementation challenges of this emerging paradigm.
93. 【2510.22593】AutoBench: Automating LLM Evaluation through Reciprocal Peer Assessment
链接:https://arxiv.org/abs/2510.22593
作者:Dario Loi,Elena Maria Muià,Federico Siciliano,Giovanni Trappolini,Vincenzo Crisà,Peter Kruger,Fabrizio Silvestri
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:evaluating Large Language, Large Language Models, evaluating Large, Large Language, fully automated
备注:
点击查看摘要
Abstract:We present AutoBench, a fully automated and self-sustaining framework for evaluating Large Language Models (LLMs) through reciprocal peer assessment. This paper provides a rigorous scientific validation of the AutoBench methodology, originally developed as an open-source project by eZecute S.R.L.. Unlike static benchmarks that suffer from test-set contamination and limited adaptability, AutoBench dynamically generates novel evaluation tasks while models alternately serve as question generators, contestants, and judges across diverse domains. An iterative weighting mechanism amplifies the influence of consistently reliable evaluators, aggregating peer judgments into consensus-based rankings that reflect collective model agreement. Our experiments demonstrate strong correlations with established benchmarks including MMLU-Pro and GPQA (respectively 78\% and 63\%), validating this peer-driven evaluation paradigm. The multi-judge design significantly outperforms single-judge baselines, confirming that distributed evaluation produces more robust and human-consistent assessments. AutoBench offers a scalable, contamination-resistant alternative to static benchmarks for the continuous evaluation of evolving language models.
94. 【2510.22590】ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
链接:https://arxiv.org/abs/2510.22590
作者:Yassir Lairgi,Ludovic Moncla,Khalid Benabdeslem,Rémy Cazabet,Pierre Cléau
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:today rapidly expanding, expanding data landscape, rapidly expanding data, dynamic memory frameworks, Temporal Knowledge Graphs
备注:
点击查看摘要
Abstract:In today's rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditional static knowledge graph (KG) construction often overlooks the dynamic and time-sensitive nature of real-world data, limiting adaptability to continuous changes. Moreover, recent zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance on prebuilt ontologies often suffer from instability across multiple runs, as well as incomplete coverage of key facts. To address these challenges, we introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that builds and continuously updates Temporal Knowledge Graphs (TKGs) from unstructured texts. ATOM splits input documents into minimal, self-contained "atomic" facts, improving extraction exhaustivity and stability. Then, it constructs atomic TKGs from these facts while employing a dual-time modeling that distinguishes when information is observed from when it is valid. The resulting atomic TKGs are subsequently merged in parallel. Empirical evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17% better stability, and over 90% latency reduction compared to baseline methods, demonstrating a strong scalability potential for dynamic TKG construction.
95. 【2510.22581】Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems
链接:https://arxiv.org/abs/2510.22581
作者:Kaushal Kumar Maurya,Ekaterina Kochmar
类目:Computation and Language (cs.CL)
关键词:developing Intelligent Tutoring, Intelligent Tutoring Systems, Intelligence in Education, Artificial Intelligence, domain of Artificial
备注: AIED 2025 (BlueSky). International Conference on Artificial Intelligence in Education. Cham: Springer Nature Switzerland, 2025
点击查看摘要
Abstract:The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from mainstream ITS development and provide comprehensive state-of-the-art evaluation practices, highlighting associated challenges through real-world case studies from careful and caring AIED research. Finally, building on insights from previous interdisciplinary AIED research, we propose three practical, feasible, and theoretically grounded research directions, rooted in learning science principles and aimed at establishing fair, unified, and scalable evaluation methodologies for ITSs.
96. 【2510.22559】A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback
链接:https://arxiv.org/abs/2510.22559
作者:Zhifeng Wang,Xinyue Zheng,Chunyan Zeng
类目:Computation and Language (cs.CL)
关键词:information technology advances, technology advances, information technology, Neural Cognitive Diagnosis, Cognitive Diagnosis model
备注: 8 pages, 6 figures
点击查看摘要
Abstract:As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback in isolation rather than as a closed loop. This leads to coarse or opaque student models, assumption-bound adaptivity that ignores diagnostic posteriors, and generic, non-actionable feedback. To address these limitations, this paper presents an end-to-end personalized learning agent, EduLoop-Agent, which integrates a Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation Computerized Adaptive Testing strategy (BECAT), and large language models (LLMs). The NCD module provides fine-grained estimates of students' mastery at the knowledge-point level; BECAT dynamically selects subsequent items to maximize relevance and learning efficiency; and LLMs convert diagnostic signals into structured, actionable feedback. Together, these components form a closed-loop framework of ``Diagnosis--Recommendation--Feedback.'' Experiments on the ASSISTments dataset show that the NCD module achieves strong performance on response prediction while yielding interpretable mastery assessments. The adaptive recommendation strategy improves item relevance and personalization, and the LLM-based feedback offers targeted study guidance aligned with identified weaknesses. Overall, the results indicate that the proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.
97. 【2510.22556】SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
链接:https://arxiv.org/abs/2510.22556
作者:Jinhan Chen,Jianchun Liu,Hongli Xu,Xianjun Gao,Shilong Wang
类目:Computation and Language (cs.CL)
关键词:Large Language Model, long-context Large Language, Language Model, Large Language, severe scalability bottleneck
备注:
点击查看摘要
Abstract:The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.
98. 【2510.22548】LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
链接:https://arxiv.org/abs/2510.22548
作者:Ziyuan He,Yuxuan Wang,Jiaqi Li,Kexin Liang,Muhan Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, remain fundamentally limited, Large language, increasingly extended context, dependency tasks remain
备注: NeurIPS 2025 Datasets and Benchmarks Track
点击查看摘要
Abstract:Large language models (LLMs) are equipped with increasingly extended context windows recently, yet their long context understanding capabilities over long dependency tasks remain fundamentally limited and underexplored. This gap is especially significant in many real-world long-context applications that were rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark designed to evaluate LLMs' long context ability in real-world applications and scenarios. Our benchmark consists of automatically collected real-world long texts, ranging from 16k to 2M tokens, encompassing domains in law, finance, game and code. Accordingly, we delicately design 10 types of domain-specific long-dependency tasks and generate 1,934 QA instances with various diversity and complexity in a scalable data curation pipeline for further practical needs. We conduct a comprehensive assessment of 6 locally deployed and 4 API-based LLMs. The evaluation results show that even the best-performing model achieves only a 59.2% overall score on our benchmark. Despite the extensive context windows, popular LLMs are only capable of understanding a much shorter length of context than they claim to be, revealing significant limitations in their ability to handle real-world tasks with long dependencies and highlighting substantial room for model improvement in practical long-context understanding.
99. 【2510.22535】OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.22535
作者:Hao Zheng,Zirui Pang,Ling li,Zhijie Deng,Yuhan Pu,Zhaowei Zhu,Xiaobo Xia,Jiaheng Wei
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, making Machine Unlearning
备注:
点击查看摘要
Abstract:Advances in Multimodal Large Language Models (MLLMs) intensify concerns about data privacy, making Machine Unlearning (MU), the selective removal of learned information, a critical necessity. However, existing MU benchmarks for MLLMs are limited by a lack of image diversity, potential inaccuracies, and insufficient evaluation scenarios, which fail to capture the complexity of real-world applications. To facilitate the development of MLLMs unlearning and alleviate the aforementioned limitations, we introduce OFFSIDE, a novel benchmark for evaluating misinformation unlearning in MLLMs based on football transfer rumors. This manually curated dataset contains 15.68K records for 80 players, providing a comprehensive framework with four test sets to assess forgetting efficacy, generalization, utility, and robustness. OFFSIDE supports advanced settings like selective unlearning and corrective relearning, and crucially, unimodal unlearning (forgetting only text data). Our extensive evaluation of multiple baselines reveals key findings: (1) Unimodal methods (erasing text-based knowledge) fail on multimodal rumors; (2) Unlearning efficacy is largely driven by catastrophic forgetting; (3) All methods struggle with "visual rumors" (rumors appear in the image); (4) The unlearned rumors can be easily recovered and (5) All methods are vulnerable to prompt attacks. These results expose significant vulnerabilities in current approaches, highlighting the need for more robust multimodal unlearning solutions. The code is available at \href{this https URL}{this https URL}.
100. 【2510.22531】xt to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection
链接:https://arxiv.org/abs/2510.22531
作者:Noshitha Padma Pratyusha Juttu,Sahithi Singireddy,Sravani Gona,Sujal Timilsina
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, domains remains constrained, specialized legal domains, legal domains remains
备注: 6 pages, including figures and tables. All experiments are reproducible. Code and fine-tuned models are publicly available on: GitHub: ( [this https URL](https://github.com/Stimils02/UnfairTOSAgreementsDetection) ) and Hugging Face: ( [this https URL](https://huggingface.co/Noshitha98) )
点击查看摘要
Abstract:Large Language Models (LLMs) have transformed text understanding, yet their adaptation to specialized legal domains remains constrained by the cost of full fine-tuning. This study provides a systematic evaluation of fine tuning, parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting strategies for unfair clause detection in Terms of Service (ToS) documents, a key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that full fine-tuning achieves the strongest precision recall balance, while LoRA-based models provide competitive recall with up to 3x lower memory cost. These findings highlight practical design trade-offs for efficient and domain-adapted LLMs, contributing open baselines for fine-tuning research in legal text processing.
101. 【2510.22500】Scalable Oversight via Partitioned Human Supervision
链接:https://arxiv.org/abs/2510.22500
作者:Ren Yin,Takashi Ishida,Masashi Sugiyama
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:obtaining high-quality human, artificial intelligence, obtaining high-quality, increasingly challenging, expert human performance
备注:
点击查看摘要
Abstract:As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that "this is not related to cardiology,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can perform better with this partitioned human supervision. Our code is available at this https URL.
102. 【2510.22495】A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus
链接:https://arxiv.org/abs/2510.22495
作者:Michael Scott,Siyu Liang,Alicia Wassink,Gina-Anne Levow
类目:Computation and Language (cs.CL)
关键词:Pacific Northwest English, Northwest English, Pacific Northwest, major commercial automatic, African American
备注:
点击查看摘要
Abstract:This paper presents a systematic evaluation of racial bias in four major commercial automatic speech recognition (ASR) systems using the Pacific Northwest English (PNWE) corpus. We analyze transcription accuracy across speakers from four ethnic backgrounds (African American, Caucasian American, ChicanX, and Yakama) and examine how sociophonetic variation contributes to differential system performance. We introduce a heuristically-determined Phonetic Error Rate (PER) metric that links recognition errors to specific linguistically motivated variables derived from sociophonetic annotation. Our analysis of eleven sociophonetic features reveals that vowel quality variation, particularly resistance to the low-back merger and pre-nasal merger patterns, is systematically associated with differential error rates across ethnic groups, with the most pronounced effects for African American speakers across all evaluated systems. These findings demonstrate that acoustic modeling of dialectal phonetic variation, rather than lexical or syntactic factors, remains a primary source of bias in commercial ASR systems. The study establishes the PNWE corpus as a valuable resource for bias evaluation in speech technologies and provides actionable guidance for improving ASR performance through targeted representation of sociophonetic diversity in training data.
103. 【2510.22492】he Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR
链接:https://arxiv.org/abs/2510.22492
作者:Siyu Liang,Nicolas Ballier,Gina-Anne Levow,Richard Wright
类目:Computation and Language (cs.CL)
关键词:learned sub-token inventory, ASR model learned, needed to fully, fully observe, analyzing Whisper decoding
备注:
点击查看摘要
Abstract:How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.
104. 【2510.22489】Frustratingly Easy Task-aware Pruning for Large Language Models
链接:https://arxiv.org/abs/2510.22489
作者:Yuanhe Tian,Junjie Liu,Xican Yang,Haishan Ye,Yan Song
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, run large language, language models, training and inference, practical solution
备注: 8 pages, 3 figures
点击查看摘要
Abstract:Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often ranks the importance of LLM parameters using their magnitudes and calibration-data activations and removes (or masks) the less important ones, accordingly reducing LLMs' size. However, these approaches primarily focus on preserving the LLM's ability to generate fluent sentences, while neglecting performance on specific domains and tasks. In this paper, we propose a simple yet effective pruning approach for LLMs that preserves task-specific capabilities while shrinking their parameter space. We first analyze how conventional pruning minimizes loss perturbation under general-domain calibration and extend this formulation by incorporating task-specific feature distributions into the importance computation of existing pruning algorithms. Thus, our framework computes separate importance scores using both general and task-specific calibration data, partitions parameters into shared and exclusive groups based on activation-norm differences, and then fuses their scores to guide the pruning process. This design enables our method to integrate seamlessly with various foundation pruning techniques and preserve the LLM's specialized abilities under compression. Experiments on widely used benchmarks demonstrate that our approach is effective and consistently outperforms the baselines with identical pruning ratios and different settings.
105. 【2510.22485】he Tonogenesis Continuum in Tibetan: A Computational Investigation
链接:https://arxiv.org/abs/2510.22485
作者:Siyu Liang,Zhaxi Zerong
类目:Computation and Language (cs.CL)
关键词:Tonogenesis-the historical process, Tonogenesis-the historical, lexical tone-has traditionally, acoustic phonetics, historical process
备注:
点击查看摘要
Abstract:Tonogenesis-the historical process by which segmental contrasts evolve into lexical tone-has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal U-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
106. 【2510.22475】CHOIR: Collaborative Harmonization fOr Inference Robustness
链接:https://arxiv.org/abs/2510.22475
作者:Xiangjue Dong,Cong Wang,Maria Teleki,Millennium Bismay,James Caverlee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Persona-assigned Large Language, Large Language Models, Persona-assigned Large, Large Language, adopt diverse roles
备注: updated version
点击查看摘要
Abstract:Persona-assigned Large Language Models (LLMs) can adopt diverse roles, enabling personalized and context-aware reasoning. However, even minor demographic perturbations in personas, such as simple pronoun changes, can alter reasoning trajectories, leading to divergent sets of correct answers. Instead of treating these variations as biases to be mitigated, we explore their potential as a constructive resource to improve reasoning robustness. We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes multiple persona-conditioned reasoning signals into a unified prediction. CHOIR orchestrates a collaborative decoding process among counterfactual personas, dynamically balancing agreement and divergence in their reasoning paths. Experiments on various reasoning benchmarks demonstrate that CHOIR consistently enhances performance across demographics, model architectures, scales, and tasks - without additional training. Improvements reach up to 26.4% for individual demographic groups and 19.2% on average across five demographics. It remains effective even when base personas are suboptimal. By reframing persona variation as a constructive signal, CHOIR provides a scalable and generalizable approach to more reliable LLM reasoning.
107. 【2510.22437】Modeling Hierarchical Thinking in Large Reasoning Models
链接:https://arxiv.org/abs/2510.22437
作者:G M Shahariar,Ali Nazari,Erfan Shayegani,Nael Abu-Ghazaleh
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, remarkable reasoning abilities, demonstrated remarkable reasoning, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning abilities when they generate step-by-step solutions, known as chain-of-thought (CoT) reasoning. When trained to using chain-of-thought reasoning examples, the resulting models (called Large Reasoning Models, or LRMs) appear to learn hierarchical thinking strategies similar to those used by humans. However, understanding LRMs emerging reasoning capabilities remains a difficult open problem, with many potential important applications including improving training and understanding robustness. In this paper, we adopt a memoryless Finite State Machine formulation to approximate LRM's emerging hierarchical reasoning dynamics as a structured, interpretable abstraction. We identify a small set of discrete reasoning states including - initialization, deduction, augmentation-strategy, uncertainty-estimation, backtracking, and final-conclusion that capture the high-level states present in the model's reasoning process. By annotating each step of a model's CoT with these states, we can represent the reasoning trajectory as a transition sequence through the state graph. This FSM formulation provides a systematic way to analyze, interpret and visualize how different models approach problems. We describe the FSM model, provide examples of CoT annotations under this scheme, and discuss how it can shed light on differences between available models in their approach to reasoning. Our results demonstrate that this FSM-based analysis reveals distinct reasoning patterns and potential shortcomings, offering a new lens to evaluate and improve LLM reasoning.
108. 【2510.22395】Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection
链接:https://arxiv.org/abs/2510.22395
作者:Federica Gamba,Aman Sinha,Timothee Mickus,Raul Vazquez,Patanjali Bhamidipati,Claudio Savelli,Ahana Chattopadhyay,Laura A. Zanella,Yash Kankanampati,Binesh Arakkal Remesh,Aryan Ashok Chandramania,Rohit Agarwal,Chuyuan Li,Ioana Buhnila,Radhika Mamidi
类目:Computation and Language (cs.CL)
关键词:Confabulations from ACL, ACL Publications, large language models, resource for studying, scientific text generation
备注:
点击查看摘要
Abstract:We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.
109. 【2510.22376】Label Smoothing Improves Gradient Ascent in LLM Unlearning
链接:https://arxiv.org/abs/2510.22376
作者:Zirui Pang,Hao Zheng,Zhijie Deng,Ling Li,Zixin Zhong,Jiaheng Wei
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Gradient Ascent, Smoothed Gradient Ascent, performing Gradient Ascent, LLM unlearning, promising approach
备注:
点击查看摘要
Abstract:LLM unlearning has emerged as a promising approach, aiming to enable models to forget hazardous/undesired knowledge at low cost while preserving as much model utility as possible. Among existing techniques, the most straightforward method is performing Gradient Ascent (GA) w.r.t. the forget data, thereby forcing the model to unlearn the forget dataset. However, GA suffers from severe instability, as it drives updates in a divergent direction, often resulting in drastically degraded model utility. To address this issue, we propose Smoothed Gradient Ascent (SGA). SGA combines the forget data with multiple constructed normal data through a tunable smoothing rate. Intuitively, this extends GA from learning solely on the forget data to jointly learning across both forget and normal data, enabling more stable unlearning while better preserving model utility. Theoretically, we provide the theoretical guidance on the selection of the optimal smoothing rate. Empirically, we evaluate SGA on three benchmarks: TOFU, Harry Potter, and MUSE-NEWS. Experimental results demonstrate that SGA consistently outperforms the original Gradient Ascent (GA) method across all metrics and achieves top-2 performance among all baseline methods on several key metrics.
110. 【2510.22373】VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
链接:https://arxiv.org/abs/2510.22373
作者:Yupeng Xie,Zhiyang Zhang,Yifan Wu,Sirong Lu,Jiayi Zhang,Zhaoyang Yu,Jinlin Wang,Sirui Hong,Bang Liu,Chenglin Wu,Yuyu Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:turn complex datasets, form of imagery, intuitive insights, faithfully represented, domain-specific yet widely
备注: 53 pages, 26 figures, 5 tables
点击查看摘要
Abstract:Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at this https URL.
111. 【2510.22371】Reasoning Models Reason Well, Until They Don't
链接:https://arxiv.org/abs/2510.22371
作者:Revanth Rameshkumar,Jimson Huang,Yunxin Sun,Fei Xia,Abulhair Saparov
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:shown significant progress, shown significant, significant progress, reasoning, large reasoning models
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.
112. 【2510.22369】GigaEmbeddings: Efficient Russian Language Embedding Model
链接:https://arxiv.org/abs/2510.22369
作者:Egor Kolodin,Daria Khomich,Nikita Savushkin,Anastasia Ianina,Fyodor Minkin
类目:Computation and Language (cs.CL)
关键词:training high-performance Russian-focused, high-performance Russian-focused text, Russian-focused text embeddings, decoder-only LLM designed, LLM designed specifically
备注:
点击查看摘要
Abstract:We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.
113. 【2510.22362】Mapping Faithful Reasoning in Language Models
链接:https://arxiv.org/abs/2510.22362
作者:Jiazheng Li,Andreas Damianou,J Rosser,José Luis Redondo García,Konstantina Palla
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:prior work shows, Concept Walk, traces promise transparency, promise transparency, prior work
备注: 9 pages, Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025
点击查看摘要
Abstract:Chain-of-thought (CoT) traces promise transparency for reasoning language models, but prior work shows they are not always faithful reflections of internal computation. This raises challenges for oversight: practitioners may misinterpret decorative reasoning as genuine. We introduce Concept Walk, a general framework for tracing how a model's internal stance evolves with respect to a concept direction during reasoning. Unlike surface text, Concept Walk operates in activation space, projecting each reasoning step onto the concept direction learned from contrastive data. This allows us to observe whether reasoning traces shape outcomes or are discarded. As a case study, we apply Concept Walk to the domain of Safety using Qwen 3-4B. We find that in 'easy' cases, perturbed CoTs are quickly ignored, indicating decorative reasoning, whereas in 'hard' cases, perturbations induce sustained shifts in internal activations, consistent with faithful reasoning. The contribution is methodological: Concept Walk provides a lens to re-examine faithfulness through concept-specific internal dynamics, helping identify when reasoning traces can be trusted and when they risk misleading practitioners.
114. 【2510.22356】Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models
链接:https://arxiv.org/abs/2510.22356
作者:Fiaz Ahmad,Nisar Hussain,Amna Qasim,Momina Hafeez,Muhammad Usman Grigori Sidorov,Alexander Gelbukh
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, English Ironic Corpus, task in Natural, Language Processing, Natural Language
备注: 5 pages, 3 figuers
点击查看摘要
Abstract:Ironic identification is a challenging task in Natural Language Processing, particularly when dealing with languages that differ in syntax and cultural context. In this work, we aim to detect irony in Urdu by translating an English Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine learning algorithms using GloVe and Word2Vec embeddings, and compare their performance with classical methods. Additionally, we fine-tune advanced transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B), and Mistral, to assess the effectiveness of large-scale models in irony detection. Among machine learning models, Gradient Boosting achieved the best performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3 (8B) achieved the highest performance with an F1-score of 94.61%. These results demonstrate that combining transliteration techniques with modern NLP models enables robust irony detection in Urdu, a historically low-resource language.
115. 【2510.22344】FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.22344
作者:Mohammad Aghajani Asl,Majid Asgari-Bidhendi,Behrooz Minaei-Bidgoli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, staleness in Large, require synthesizing information
备注: 30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop Question Answering, Faithfulness
点击查看摘要
Abstract:While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
116. 【2510.22340】DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
链接:https://arxiv.org/abs/2510.22340
作者:Changti Wu,Shijie Lian,Zihao Liu,Lei Zhang,Laurence Tianruo Yang,Kai Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Solid geometry problem, problem solving demands, geometry problem solving, solving demands spatial, Solid geometry
备注: The code and dataset are available at \href{ [this https URL](https://zgca-ai4edu.github.io/DynaSolidGeo/) }{DynaSolidGeo}
点击查看摘要
Abstract:Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{this https URL}{DynaSolidGeo}.
117. 【2510.22334】Multilingual Target-Stance Extraction
链接:https://arxiv.org/abs/2510.22334
作者:Ethan Mines,Bonnie Dorr
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Social media enables, media enables data-driven, enables data-driven analysis, Social media, contested issues
备注: 11 pages, 2 figures, Submitted to the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
点击查看摘要
Abstract:Social media enables data-driven analysis of public opinion on contested issues. Target-Stance Extraction (TSE) is the task of identifying the target discussed in a document and the document's stance towards that target. Many works classify stance towards a given target in a multilingual setting, but all prior work in TSE is English-only. This work introduces the first multilingual TSE benchmark, spanning Catalan, Estonian, French, Italian, Mandarin, and Spanish corpora. It manages to extend the original TSE pipeline to a multilingual setting without requiring separate models for each language. Our model pipeline achieves a modest F1 score of 12.78, underscoring the increased difficulty of the multilingual task relative to English-only setups and highlighting target prediction as the primary bottleneck. We are also the first to demonstrate the sensitivity of TSE's F1 score to different target verbalizations. Together these serve as a much-needed baseline for resources, algorithms, and evaluation criteria in multilingual TSE.
118. 【2510.22317】Memory-based Language Models: An Efficient, Explainable, and Eco-friendly Approach to Large Language Modeling
链接:https://arxiv.org/abs/2510.22317
作者:Antal van den Bosch,Ainhoa Risco Patón,Teun Buijse,Peter Berck,Maarten van Gompel
类目:Computation and Language (cs.CL)
关键词:deep neural network-based, neural network-based language, memory-based language modeling, network-based language modeling, present memory-based language
备注: 15 pages, 11 figures
点击查看摘要
Abstract:We present memory-based language modeling as an efficient, eco-friendly alternative to deep neural network-based language modeling. It offers log-linearly scalable next-token prediction performance and strong memorization capabilities. Implementing fast approximations of k-nearest neighbor classification, memory-based language modeling leaves a relatively small ecological footprint both in training and in inference mode, as it relies fully on CPUs and attains low token latencies. Its internal workings are simple and fully transparent. We compare our implementation of memory-based language modeling, OLIFANT, with GPT-2 and GPT-Neo on next-token prediction accuracy, estimated emissions and speeds, and offer some deeper analyses of the model.
119. 【2510.22295】VietLyrics: A Large-Scale Dataset and Models for Vietnamese Automatic Lyrics Transcription
链接:https://arxiv.org/abs/2510.22295
作者:Quoc Anh Nguyen,Bernard Cheng,Kelvin Soh
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Automatic Lyrics Transcription, unique challenges due, largely unexplored due, presents unique challenges, remains largely unexplored
备注:
点击查看摘要
Abstract:Automatic Lyrics Transcription (ALT) for Vietnamese music presents unique challenges due to its tonal complexity and dialectal variations, but remains largely unexplored due to the lack of a dedicated dataset. Therefore, we curated the first large-scale Vietnamese ALT dataset (VietLyrics), comprising 647 hours of songs with line-level aligned lyrics and metadata to address these issues. Our evaluation of current ASRbased approaches reveal significant limitations, including frequent transcription errors and hallucinations in non-vocal segments. To improve performance, we fine-tuned Whisper models on the VietLyrics dataset, achieving superior results compared to existing multilingual ALT systems, including LyricWhiz. We publicly release VietLyrics and our models, aiming to advance Vietnamese music computing research while demonstrating the potential of this approach for ALT in low-resource language and music.
120. 【2510.22285】Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER
链接:https://arxiv.org/abs/2510.22285
作者:Andrei Baroian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Named Entity Recognition, study clinical Named, clinical Named Entity, few-shot in-context learning, BERT Base
备注: Work done in November - December 2024
点击查看摘要
Abstract:We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.
121. 【2510.22282】CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
链接:https://arxiv.org/abs/2510.22282
作者:Tianhui Liu,Hetian Pang,Xin Zhang,Jie Feng,Yong Li,Pan Hui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:sustainable development goals, achieving global sustainable, global sustainable development, Harnessing publicly, large-scale web data
备注:
点击查看摘要
Abstract:Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
122. 【2510.22276】WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
链接:https://arxiv.org/abs/2510.22276
作者:Issa Sugiura,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Yasuo Okabe,Naoaki Okazaki
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:high-quality image-text pair, Japanese image-text pair, Large-scale and high-quality, developing high-performing Vision-Language, high-quality Japanese image-text
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. Our dataset construction pipeline employs various techniques, including filtering and deduplication, which have been shown to be effective in previous studies. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification, consisting of 374 classes. To assess the effectiveness of our dataset, we conduct experiments using both WAON and the Japanese subset of ReLAION, one of the most widely used vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on both datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks. We release our dataset, model, and code at this https URL.
123. 【2510.22272】From Slides to Chatbots: Enhancing Large Language Models with University Course Materials
链接:https://arxiv.org/abs/2510.22272
作者:Tu Anh Dinh,Philipp Nicolas Schumacher,Jan Niehues
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, recent years, advanced rapidly
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have advanced rapidly in recent years. One application of LLMs is to support student learning in educational settings. However, prior work has shown that LLMs still struggle to answer questions accurately within university-level computer science courses. In this work, we investigate how incorporating university course materials can enhance LLM performance in this setting. A key challenge lies in leveraging diverse course materials such as lecture slides and transcripts, which differ substantially from typical textual corpora: slides also contain visual elements like images and formulas, while transcripts contain spoken, less structured language. We compare two strategies, Retrieval-Augmented Generation (RAG) and Continual Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture slides, we further explore a multi-modal RAG approach, where we present the retrieved content to the generator in image form. Our experiments reveal that, given the relatively small size of university course materials, RAG is more effective and efficient than CPT. Moreover, incorporating slides as images in the multi-modal setting significantly improves performance over text-only retrieval. These findings highlight practical strategies for developing AI assistants that better support learning and teaching, and we hope they inspire similar efforts in other educational contexts.
124. 【2510.22264】PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
链接:https://arxiv.org/abs/2510.22264
作者:Iliass Ayaou,Denis Cavallucci
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:prior art search, capture patent-specific challenges, enable prior art, inadequately capture patent-specific, technology landscaping
备注:
点击查看摘要
Abstract:Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at this https URL. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.
125. 【2510.22256】SteerX: Disentangled Steering for LLM Personalization
链接:https://arxiv.org/abs/2510.22256
作者:Xiaoyan Zhao,Ming Yan,Yilun Qiu,Haoting Ni,Yang Zhang,Fuli Feng,Hong Cheng,Tat-Seng Chua
类目:Computation and Language (cs.CL)
关键词:Large language models, shown remarkable success, support users' daily, users' daily life, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable success in recent years, enabling a wide range of applications, including intelligent assistants that support users' daily life and work. A critical factor in building such assistants is personalizing LLMs, as user preferences and needs vary widely. Activation steering, which directly leverages directions representing user preference in the LLM activation space to adjust its behavior, offers a cost-effective way to align the model's outputs with individual users. However, existing methods rely on all historical data to compute the steering vector, ignoring that not all content reflects true user preferences, which undermines the personalization signal. To address this, we propose SteerX, a disentangled steering method that isolates preference-driven components from preference-agnostic components. Grounded in causal inference theory, SteerX estimates token-level causal effects to identify preference-driven tokens, transforms these discrete signals into a coherent description, and then leverages them to steer personalized LLM generation. By focusing on the truly preference-driven information, SteerX produces more accurate activation steering vectors and enhances personalization. Experiments on two representative steering backbone methods across real-world datasets demonstrate that SteerX consistently enhances steering vector quality, offering a practical solution for more effective LLM personalization.
126. 【2510.22255】PACR: Progressively Ascending Confidence Reward for LLM Reasoning
链接:https://arxiv.org/abs/2510.22255
作者:Eunseop Yoon,Hee Suk Yoon,Jaehyun Jang,SooHwan Eom,Qi Dai,Chong Luo,Mark A. Hasegawa-Johnson,Chang D. Yoo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:significantly improved LLM, Reinforcement Learning, Learning with Verifiable, improved LLM reasoning, improved LLM
备注: 16 pages, 14 figures
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved LLM reasoning, but its sparse, outcome-based reward provides no guidance for intermediate steps, slowing exploration. We propose Progressively Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer. PACR encodes the inductive bias that, along a well-formed reasoning trajectory, the probability of the ground-truth answer should have a generally ascending trend. We provide empirical and theoretical analysis validating that such an inductive bias constrains the exploration search space to regions richer in logically sound reasoning. We demonstrate that PACR accelerates exploration, reaches reward saturation with fewer trajectories, and yields improvements on multiple benchmarks. Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
127. 【2510.22251】You Don't Need Prompt Engineering Anymore: The Prompting Inversion
链接:https://arxiv.org/abs/2510.22251
作者:Imran Khan(Independent Researcher)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:significantly enhances LLM, enhances LLM reasoning, enhances LLM, LLM reasoning capabilities, LLM reasoning
备注: 17 pages, 1 figure, 6 tables. Code and experimental data available at [this https URL](https://github.com/strongSoda/prompt-sculpting)
点击查看摘要
Abstract:Prompt engineering, particularly Chain-of-Thought (CoT) prompting, significantly enhances LLM reasoning capabilities. We introduce "Sculpting," a constrained, rule-based prompting method designed to improve upon standard CoT by reducing errors from semantic ambiguity and flawed common sense. We evaluate three prompting strategies (Zero Shot, standard CoT, and Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5) using the GSM8K mathematical reasoning benchmark (1,317 problems). Our findings reveal a "Prompting Inversion": Sculpting provides advantages on gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00% vs. 96.36% for CoT on full benchmark). We trace this to a "Guardrail-to-Handcuff" transition where constraints preventing common-sense errors in mid-tier models induce hyper-literalism in advanced models. Our detailed error analysis demonstrates that optimal prompting strategies must co-evolve with model capabilities, suggesting simpler prompts for more capable models.
Comments:
17 pages, 1 figure, 6 tables. Code and experimental data available at this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2510.22251 [cs.CL]
(or
arXiv:2510.22251v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.22251
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
128. 【2510.22242】PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
链接:https://arxiv.org/abs/2510.22242
作者:Yutao Wu,Xiao Liu,Yunhao Feng,Jiale Ding,Xingjun Ma
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, tasks remains under-evaluated, increasingly serve
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.
129. 【2510.22220】Evolution of the lexicon: a probabilistic point of view
链接:https://arxiv.org/abs/2510.22220
作者:Maurizio Serva
类目:Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)
关键词:Swadesh approach, temporal separation, Swadesh, languages relies, word emerges
备注:
点击查看摘要
Abstract:The Swadesh approach for determining the temporal separation between two languages relies on the stochastic process of words replacement (when a complete new word emerges to represent a given concept). It is well known that the basic assumptions of the Swadesh approach are often unrealistic due to various contamination phenomena and misjudgments (horizontal transfers, variations over time and space of the replacement rate, incorrect assessments of cognacy relationships, presence of synonyms, and so on). All of this means that the results cannot be completely correct. More importantly, even in the unrealistic case that all basic assumptions are satisfied, simple mathematics places limits on the accuracy of estimating the temporal separation between two languages. These limits, which are purely probabilistic in nature and which are often neglected in lexicostatistical studies, are analyzed in detail in this article. Furthermore, in this work we highlight that the evolution of a language's lexicon is also driven by another stochastic process: gradual lexical modification of words. We show that this process equally also represents a major contribution to the reshaping of the vocabulary of languages over the centuries and we also show, from a purely probabilistic perspective, that taking into account this second random process significantly increases the precision in determining the temporal separation between two languages.
Subjects:
Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)
Cite as:
arXiv:2510.22220 [cs.CL]
(or
arXiv:2510.22220v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.22220
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
130. 【2510.22219】Estimating the Error of Large Language Models at Pairwise Text Comparison
链接:https://arxiv.org/abs/2510.22219
作者:Tianyi Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Probability (math.PR)
关键词:measure LLMs' output, LLMs' output error, error rates, pairwise text comparison, error
备注: 14 pages, 6 figures
点击查看摘要
Abstract:We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias terms are similar, close to the uniform error. Considering both the error rates and the robustness to the variation of prompts, Claude obtained the most desirable performance in this experiment. Our model outperforms the biased Bradley-Terry model and the commutativity score in indicating LLMs' error at this task.
131. 【2510.22212】DETECT: Determining Ease and Textual Clarity of German Text Simplifications
链接:https://arxiv.org/abs/2510.22212
作者:Maria Korobeynikova,Alessia Battisti,Lukas Fischer,Yingqiang Gao
类目:Computation and Language (cs.CL)
关键词:insufficiently capture simplification, relies on general-purpose, Current evaluation, insufficiently capture, meaning preservation
备注:
点击查看摘要
Abstract:Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.
132. 【2510.22207】he Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)
链接:https://arxiv.org/abs/2510.22207
作者:Nnamdi Aghanya,Jun Li,Kewei Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
关键词:Large Language Models, achieve near-optimal lossless, near-optimal lossless compression, powerful probability models, Large Language
备注: 12 pages, 7 figures
点击查看摘要
Abstract:Large Language Models (LLMs) can achieve near-optimal lossless compression by acting as powerful probability models. We investigate their use in the lossy domain, where reconstruction fidelity is traded for higher compression ratios. This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec that leverages a Masked Language Model (MLM) as a decompressor. Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect. This creates a residual channel that offers continuous rate-distortion control. We compare EPC to a simpler Predictive Masking (PM) baseline and a transform-based Vector Quantisation with a Residual Patch (VQ+RE) approach. Through an evaluation that includes precise bit accounting and rate-distortion analysis, we demonstrate that EPC consistently dominates PM, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.
133. 【2510.22172】M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR
链接:https://arxiv.org/abs/2510.22172
作者:Ruixiang Mao,Xiangnan Ma,Qing Yang,Ziming Zhu,Yucheng Qiao,Yuan Ge,Tong Xiao,Shengxiang Gao,Zhengtao Yu,Jingbo Zhu
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:speech recognition, NAR approaches, Continuous, NAR, propose Multi-scale CIF
备注:
点击查看摘要
Abstract:The Continuous Integrate-and-Fire (CIF) mechanism provides effective alignment for non-autoregressive (NAR) speech recognition. This mechanism creates a smooth and monotonic mapping from acoustic features to target tokens, achieving performance on Mandarin competitive with other NAR approaches. However, without finer-grained guidance, its stability degrades in some languages such as English and French. In this paper, we propose Multi-scale CIF (M-CIF), which performs multi-level alignment by integrating character and phoneme level supervision progressively distilled into subword representations, thereby enhancing robust acoustic-text alignment. Experiments show that M-CIF reduces WER compared to the Paraformer baseline, especially on CommonVoice by 4.21% in German and 3.05% in French. To further investigate these gains, we define phonetic confusion errors (PE) and space-related segmentation errors (SE) as evaluation metrics. Analysis of these metrics across different M-CIF settings reveals that the phoneme and character layers are essential for enhancing progressive CIF alignment.
134. 【2510.22162】Surface Reading LLMs: Synthetic Text and its Styles
链接:https://arxiv.org/abs/2510.22162
作者:Hannes Bajohr
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:large language models, language models lies, potential plateau, societal impact, impact of large
备注: 12 pages, 1 figure
点击查看摘要
Abstract:Despite a potential plateau in ML advancement, the societal impact of large language models lies not in approaching superintelligence but in generating text surfaces indistinguishable from human writing. While Critical AI Studies provides essential material and socio-technical critique, it risks overlooking how LLMs phenomenologically reshape meaning-making. This paper proposes a semiotics of "surface integrity" as attending to the immediate plane where LLMs inscribe themselves into human communication. I distinguish three knowledge interests in ML research (epistemology, epistēmē, and epistemics) and argue for integrating surface-level stylistic analysis alongside depth-oriented critique. Through two case studies examining stylistic markers of synthetic text, I argue how attending to style as a semiotic phenomenon reveals LLMs as cultural actors that transform the conditions of meaning emergence and circulation in contemporary discourse, independent of questions about machine consciousness.
135. 【2510.22160】SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language
链接:https://arxiv.org/abs/2510.22160
作者:Rahul Ranjan,Mahendra Kumar Gurve,Anuj,Nitin,Yamuna Prasad
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:poses significant challenges, languages poses significant, Developing benchmark datasets, low-resource languages poses, primarily due
备注:
点击查看摘要
Abstract:Developing benchmark datasets for low-resource languages poses significant challenges, primarily due to the limited availability of native linguistic experts and the substantial time and cost involved in annotation. Given these challenges, Maithili is still underrepresented in natural language processing research. It is an Indo-Aryan language spoken by more than 13 million people in the Purvanchal region of India, valued for its rich linguistic structure and cultural significance. While sentiment analysis has achieved remarkable progress in high-resource languages, resources for low-resource languages, such as Maithili, remain scarce, often restricted to coarse-grained annotations and lacking interpretability mechanisms. To address this limitation, we introduce a novel dataset comprising 3,221 Maithili sentences annotated for sentiment polarity and accompanied by natural language justifications. Moreover, the dataset is carefully curated and validated by linguistic experts to ensure both label reliability and contextual fidelity. Notably, the justifications are written in Maithili, thereby promoting culturally grounded interpretation and enhancing the explainability of sentiment models. Furthermore, extensive experiments using both classical machine learning and state-of-the-art transformer architectures demonstrate the dataset's effectiveness for interpretable sentiment analysis. Ultimately, this work establishes the first benchmark for explainable affective computing in Maithili, thus contributing a valuable resource to the broader advancement of multilingual NLP and explainable AI.
136. 【2510.22149】Power to the Clients: Federated Learning in a Dictatorship Setting
链接:https://arxiv.org/abs/2510.22149
作者:Mohammadsajad Alipour,Mohammad Mohammadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Federated learning, local data, promising paradigm, collaboratively learn, learn a shared
备注:
点击查看摘要
Abstract:Federated learning (FL) has emerged as a promising paradigm for decentralized model training, enabling multiple clients to collaboratively learn a shared model without exchanging their local data. However, the decentralized nature of FL also introduces vulnerabilities, as malicious clients can compromise or manipulate the training process. In this work, we introduce dictator clients, a novel, well-defined, and analytically tractable class of malicious participants capable of entirely erasing the contributions of all other clients from the server model, while preserving their own. We propose concrete attack strategies that empower such clients and systematically analyze their effects on the learning process. Furthermore, we explore complex scenarios involving multiple dictator clients, including cases where they collaborate, act independently, or form an alliance in order to ultimately betray one another. For each of these settings, we provide a theoretical analysis of their impact on the global model's convergence. Our theoretical algorithms and findings about the complex scenarios including multiple dictator clients are further supported by empirical evaluations on both computer vision and natural language processing benchmarks.
137. 【2510.22143】OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue
链接:https://arxiv.org/abs/2510.22143
作者:Tianhong Gao,Jundong Shen,Bei Shi,Jiapeng Wang,Ying Ju,Junfeng Yao,Jiao Ran,Yong Zhang,Lin Dong,Huiyu Yu,Tingting Ye
类目:Computation and Language (cs.CL)
关键词:Web-based customer service, achieving remarkable improvements, customer service, adopted in Web-based, Web-based domains
备注:
点击查看摘要
Abstract:Intelligent customer service (ICS) systems via retrieval-augmented generation (RAG) have been widely adopted in Web-based domains such as social platforms and e-commerce, achieving remarkable improvements in automation and efficiency. However, notable limitations still remain: these systems are prone to hallucinations and often generate rigid, mechanical responses, which can introduce business risks and undermine user experience, especially in Web-based customer service interactions under the RAG scenarios. In this paper, we introduce OlaMind, a human-like and hallucination-safe customer service framework for retrieval-augmented dialogue. Specifically, it first leverages a Learn-to-Think stage to learn the reasoning processes and response strategies from human experts, and then employs a Learn-to-Respond stage to perform cold-start supervised fine-tuning (SFT) combined with reinforcement learning (RL) for basic-to-hard self-refinement. Our method significantly enhances human-likeness and naturalness while effectively mitigating hallucinations and critical business risks. We have conducted large-scale online A/B experiments in an industry-level social customer service setting, and extensive experimental results show that OlaMind achieves significant cumulative relative improvements with intelligent resolution rates +28.92%/+18.42% and human takeover rate -6.08%/-7.12% in community-support/livestream-interaction scenarios, respectively, which highlights its consistent effectiveness across diverse real-world applications. The code and data will be publicly available.
138. 【2510.22141】LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
链接:https://arxiv.org/abs/2510.22141
作者:Yuhang Gao,Xiang Xiang,Sheng Zhong,Guoyou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:shown significant progress, shown significant, significant progress, Vision-Language Models, Densely Contrastive Learning
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.
139. 【2510.22139】Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
链接:https://arxiv.org/abs/2510.22139
作者:Jinzhe Liu,Junshu Sun,Shufan Shen,Chenxue Yang,Shuhui Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:expensive full retraining, computationally expensive full, large language models, full retraining, updates to outdated
备注: 19 pages, 11 figures, Accepted by NeurIPS 2025
点击查看摘要
Abstract:Lifelong knowledge editing enables continuous, precise updates to outdated knowledge in large language models (LLMs) without computationally expensive full retraining. However, existing methods often accumulate errors throughout the editing process, causing a gradual decline in both editing accuracy and generalization. To tackle this problem, we propose Neuron-Specific Masked Knowledge Editing (NMKE), a novel fine-grained editing framework that combines neuron-level attribution with dynamic sparse masking. Leveraging neuron functional attribution, we identify two key types of knowledge neurons, with knowledge-general neurons activating consistently across prompts and knowledge-specific neurons activating to specific prompts. NMKE further introduces an entropy-guided dynamic sparse mask, locating relevant neurons to the target knowledge. This strategy enables precise neuron-level knowledge editing with fewer parameter modifications. Experimental results from thousands of sequential edits demonstrate that NMKE outperforms existing methods in maintaining high editing success rates and preserving model general capabilities in lifelong editing.
140. 【2510.22115】Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
链接:https://arxiv.org/abs/2510.22115
作者:Ling-Team,Ang Li,Ben Liu,Binbin Hu,Bing Li,Bingwei Zeng,Borui Ye,Caizhi Tang,Changxin Tian,Chao Huang,Chao Zhang,Chen Qian,Chenchen Ju,Chenchen Li,Chengfu Tang,Chili Fu,Chunshao Ren,Chunwei Wu,Cong Zhang,Cunyin Peng,Dafeng Xu,Daixin Wang,Dalong Zhang,Dingnan Jin,Dingyuan Zhu,Dongke Hu,Fangzheng Zhao,Feifan Wu,Feng Zhu,Gangshan Wang,Haitao Zhang,Hailin Zhao,Hanxiao Zhang,Hanzi Wang,Hao Qian,Haoyi Yu,Heng Zhang,Hongliang Zhang,Hongzhi Luan,Huirong Dong,Huizhong Li,Jia Li,Jia Liu,Jialong Zhu,Jian Sha,Jianping Wei,Jiaolong Yang,Jieyue Ma,Jiewei Wu,Jinjing Huang,Jingyun Tian,Jingyuan Zhang,Jinquan Sun,Juanhui Tu,Jun Liu,Jun Xu,Jun Zhou,Junjie Ou,Junpeng Fang,Kaihong Zhang,Kaiqin Hu,Ke Shi,Kun Tang,Kunlong Chen,Lanyin Mei,Lei Liang,Lei Xu,Libo Zhang,Lin Ju,Lin Yuan,Ling Zhong,Lintao Ma,Lu Liu,Lu Yu,Lun Cai,Meiqi Zhu,Mengying Li,Min Chen,Minghao Xue,Minghong Cai,Mingming Yin,Peijie Jiang,Peilong Zhao,Pingping Liu,Qian Zhao,Qing Cui,Qingxiang Huang,Qingyuan Yang,Quankun Yu,Shaowei Wei,Shijie Lian,Shoujian Zheng,Shun Song,Shungen Zhang,Shuo Zhang,Siyuan Li,Song Liu,Ting Guo,Tong Zhao,Wanli Gu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:boosts reasoning capability, introduce Ling, series reasoning-oriented language, reasoning-oriented language foundation, language foundation built
备注: Ling 2.0 Technical Report
点击查看摘要
Abstract:We introduce Ling 2.0, a series reasoning-oriented language foundation built upon the principle that every activation boosts reasoning capability. Designed to scale from tens of billions to one trillion parameters under a unified Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity, cross-scale consistency, and efficiency guided by empirical scaling laws. The series includes three non-thinking (instruct) models - Ling-mini-2.0, Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and achieving up to 7-fold active-compute efficiency compared with dense counterparts. Ling 2.0 integrates coordinated innovations across model architecture, pre-training, post-training, and infrastructure: a high-sparsity MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale FP8 training with fine-grained heterogeneous pipelines. At the trillion scale, Ling-1T establishes a new Pareto frontier of reasoning accuracy versus computational efficiency, demonstrating that sparse activation, when properly aligned with reasoning objectives, enables scalable and efficient intelligence. Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for advancing future reasoning and thinking models, including the Ring series built upon the same base.
141. 【2510.22109】Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
链接:https://arxiv.org/abs/2510.22109
作者:Billy Dickson,Zoran Tiganj
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:long-context processing increase, auxiliary memory modules, transformer internal architecture, approaches to long-context, long-context processing
备注:
点击查看摘要
Abstract:Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that modifies the input representation itself, rather than the transformer architecture. Inspired by cognitive models of human memory, our method applies a scale-invariant logarithmic compression to the input tokens. The resulting compressed representation is processed by a standard, unmodified transformer, preserving architectural simplicity. We evaluate this approach on the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in perplexity compared to uncompressed baselines. Moreover, performance improves consistently with longer compressed temporal contexts, showing that input-level logarithmic compression is a simple and effective way to extend a transformer's long-range memory.
142. 【2510.22102】Mitigating Coordinate Prediction Bias from Positional Encoding Failures
链接:https://arxiv.org/abs/2510.22102
作者:Xingjian Tao,Yiwei Wang,Yujun Cai,Yihong Luo,Jing Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, VQA and document, prediction remains challenging, large language models
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs exacerbate this difficulty by producing long token sequences that weaken positional encodings and introduce directional biases in coordinate outputs. We investigate this phenomenon by analyzing how MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed through shuffling. Our analysis reveals that such perturbations induce predictable, non-random coordinate biases rather than random errors, suggesting that models rely on internal positional priors when spatial grounding signals are degraded. Crucially, we observe similar directional error patterns in natural high-resolution datasets, indicating that positional encoding failures are a key bottleneck for accurate coordinate prediction at scale. To address this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free test-time method that leverages the directional nature of these biases for correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate position-unconditioned tendencies, then uses this as negative evidence to guide digit prediction while preserving coordinate format through a lightweight finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable improvements, highlighting positional encoding robustness as a critical factor for spatial reasoning in MLLMs.
143. 【2510.22099】Generalization or Memorization: Dynamic Decoding for Mode Steering
链接:https://arxiv.org/abs/2510.22099
作者:Xuanming Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, exhibit a troubling, troubling duality, training data
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit a troubling duality, capable of both remarkable generalization and brittle, verbatim memorization of their training data. This unpredictability undermines their reliability in high-stakes applications. In this work, we propose a unified framework to understand, identify, and control these distinct reasoning modes. First, we introduce a theoretical model based on the Information Bottleneck (IB) principle, formalizing generalization as the learning of a compressed, task-relevant representation and memorization as a failure to compress. Building on this theory, we develop Dynamic Mode Steering (DMS), a novel inference-time algorithm which comprises two components: (1) a lightweight, causally-grounded linear probe that identifies the model's instantaneous reliance on memorization, and (2) a dynamic activation steering mechanism that nudges the model's computation towards pre-identified generalization circuits. We frame DMS as a form of adaptive, self-contrastive decoding. Experiments on reasoning and faithfulness tasks demonstrate that DMS significantly improves logical consistency and factual accuracy, thereby offering a principled approach to enhancing LLM reliability.
144. 【2510.22095】Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
链接:https://arxiv.org/abs/2510.22095
作者:Yankai Chen,Xinni Zhang,Yifei Zhang,Yangning Li,Henry Peng Zou,Chunyu Miao,Weizhi Zhang,Xue Liu,Philip S. Yu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:holding significant promise, severe neurological impairments, direct communication pathway, Brain-Computer Interfaces, offer a direct
备注: Accepted by NeurIPS'25 Position Track
点击查看摘要
Abstract:Brain-Computer Interfaces (BCIs) offer a direct communication pathway between the human brain and external devices, holding significant promise for individuals with severe neurological impairments. However, their widespread adoption is hindered by critical limitations, such as low information transfer rates and extensive user-specific calibration. To overcome these challenges, recent research has explored the integration of Large Language Models (LLMs), extending the focus from simple command decoding to understanding complex cognitive states. Despite these advancements, deploying agentic AI faces technical hurdles and ethical concerns. Due to the lack of comprehensive discussion on this emerging direction, this position paper argues that the field is poised for a paradigm extension from BCI to Brain-Agent Collaboration (BAC). We emphasize reframing agents as active and collaborative partners for intelligent assistance rather than passive brain signal data processors, demanding a focus on ethical data handling, model reliability, and a robust human-agent collaboration framework to ensure these systems are safe, trustworthy, and effective.
145. 【2510.22085】Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
链接:https://arxiv.org/abs/2510.22085
作者:Pavlos Ntais
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, exploit contextual framing, Large language, sophisticated prompt engineering, posing significant risks
备注: 18 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) remain vulnerable to sophisticated prompt engineering attacks that exploit contextual framing to bypass safety mechanisms, posing significant risks in cybersecurity applications. We introduce Jailbreak Mimicry, a systematic methodology for training compact attacker models to automatically generate narrative-based jailbreak prompts in a one-shot manner. Our approach transforms adversarial prompt discovery from manual craftsmanship into a reproducible scientific process, enabling proactive vulnerability assessment in AI-driven security systems. Developed for the OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench, achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out test set of 200 items. Cross-model evaluation reveals significant variation in vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad applicability and model-specific defensive strengths in cybersecurity contexts. This represents a 54x improvement over direct prompting (1.5% ASR) and demonstrates systematic vulnerabilities in current safety alignment approaches. Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable, highlighting threats to AI-integrated threat detection, malware analysis, and secure systems, while physical harm categories show greater resistance (55.6% ASR). We employ automated harmfulness evaluation using Claude Sonnet 4, cross-validated with human expert assessment, ensuring reliable and scalable evaluation for cybersecurity red-teaming. Finally, we analyze failure mechanisms and discuss defensive strategies to mitigate these vulnerabilities in AI for cybersecurity.
146. 【2510.22084】Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds
链接:https://arxiv.org/abs/2510.22084
作者:Atij Mahesh
类目:Computation and Language (cs.CL)
关键词:Large Language Models, produce gender-stereotyped language, Large Language, Language Models, reflect deep societal
备注: 20 pages
点击查看摘要
Abstract:Large Language Models (LLMs) still produce gender-stereotyped language even in occupation-neutral contexts that reflect deep societal biases (Rudinger et al., 2018). To address this, prior work has proposed prompting, constrained decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022). However, the comparative efficacy and learning dynamics remain little understood. We report a comparative analysis of six control techniques for bias mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Iterative Nullspace Projection (INLP). We evaluate each method on a compositional constraint task. This task requires generating sentences that contain at least one agentic and one communal descriptor for each of the twenty Winogender-derived occupations. We quantify trade-offs between control strength and naturalness with evaluations of constraint compliance, lexical diversity, and fluency. Our results reveal key contrasts among the methods: SFT achieves 99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect compliance, but at the cost of severely reduced fluency and diversity. Preference-based learning fundamentally differs: it cannot satisfy compositional constraints, as binary preference signals encode ranking, not logical conjunctions. Only explicit positive supervision enables mitigation of compositional biases; preference-based alignment fails to generalize logical structures, underscoring the limitations of preference learning and the necessity of explicit supervision for fair and fluent controlled generation.
147. 【2510.22075】Agentic Reinforcement Learning for Real-World Code Repair
链接:https://arxiv.org/abs/2510.22075
作者:Siyu Zhu,Anastasiya Karpovich,Albert Chen,Jessica Koscheka,Shailesh Jannu,Di Wen,Yuqing Zhu,Rohit Jain,Alborz Geramifard
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:make evaluation unstable, shifting dependencies make, dependencies make evaluation, evaluation unstable, tackle the challenge
备注:
点击查看摘要
Abstract:We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments, highlighting the importance of matching train-test environments for building reliable real-world code-fixing agents.
148. 【2510.22055】A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition
链接:https://arxiv.org/abs/2510.22055
作者:V Venktesh,Deepali Prabhu,Avishek Anand
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Fact-checking numerical claims, numbers provide mirage, fake potentially causing, potentially causing catastrophic, causing catastrophic impacts
备注: 16 pages
点击查看摘要
Abstract:Fact-checking numerical claims is critical as the presence of numbers provide mirage of veracity despite being fake potentially causing catastrophic impacts on society. The prior works in automatic fact verification do not primarily focus on natural numerical claims. A typical human fact-checker first retrieves relevant evidence addressing the different numerical aspects of the claim and then reasons about them to predict the veracity of the claim. Hence, the search process of a human fact-checker is a crucial skill that forms the foundation of the verification process. Emulating a real-world setting is essential to aid in the development of automated methods that encompass such skills. However, existing benchmarks employ heuristic claim decomposition approaches augmented with weakly supervised web search to collect evidences for verifying claims. This sometimes results in less relevant evidences and noisy sources with temporal leakage rendering a less realistic retrieval setting for claim verification. Hence, we introduce QuanTemp++: a dataset consisting of natural numerical claims, an open domain corpus, with the corresponding relevant evidence for each claim. The evidences are collected through a claim decomposition process approximately emulating the approach of human fact-checker and veracity labels ensuring there is no temporal leakage. Given this dataset, we also characterize the retrieval performance of key claim decomposition paradigms. Finally, we observe their effect on the outcome of the verification pipeline and draw insights. The code for data pipeline along with link to data can be found at this https URL
149. 【2510.22042】Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
链接:https://arxiv.org/abs/2510.22042
作者:Benjamin Reichman,Adar Avsian,Larry Heck
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, internally represent emotion, internally represent, work investigates, investigates how large
备注:
点击查看摘要
Abstract:This work investigates how large language models (LLMs) internally represent emotion by analyzing the geometry of their hidden-state space. The paper identifies a low-dimensional emotional manifold and shows that emotional representations are directionally encoded, distributed across layers, and aligned with interpretable dimensions. These structures are stable across depth and generalize to eight real-world emotion datasets spanning five languages. Cross-domain alignment yields low error and strong linear probe performance, indicating a universal emotional subspace. Within this space, internal emotion perception can be steered while preserving semantics using a learned intervention module, with especially strong control for basic emotions across languages. These findings reveal a consistent and manipulable affective geometry in LLMs and offer insight into how they internalize and process emotion.
150. 【2510.22037】ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
链接:https://arxiv.org/abs/2510.22037
作者:Shayne Longpre,Sneha Kudugunta,Niklas Muennighoff,I-Hung Hsu,Isaac Caswell,Alex Pentland,Sercan Arik,Chen-Yu Lee,Sayna Ebrahimi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:explicitly serve billions, overwhelmingly on English, Scaling laws research, models explicitly serve, Scaling laws
备注:
点击查看摘要
Abstract:Scaling laws research has focused overwhelmingly on English -- yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws' out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models -- beyond English-first AI.
151. 【2510.22028】Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
链接:https://arxiv.org/abs/2510.22028
作者:Yilin Zhang,Wenda Xu,Zhongtao Liu,Tetsuji Nakagawa,Markus Freitag
类目:Computation and Language (cs.CL)
关键词:Quality Estimation, vital in machine, reward signal, signal in tasks, Quality
备注:
点击查看摘要
Abstract:Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
152. 【2510.22014】oward Understanding the Transferability of Adversarial Suffixes in Large Language Models
链接:https://arxiv.org/abs/2510.22014
作者:Sarah Ball,Niki Hasrati,Alexander Robey,Avi Schwarzschild,Frauke Kreuter,Zico Kolter,Andrej Risteski
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Discrete optimization-based jailbreaking, elicit disallowed content, Discrete optimization-based, language models aim, optimization-based jailbreaking attacks
备注:
点击查看摘要
Abstract:Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable -- succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model's internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
153. 【2510.22007】Optimal Detection for Language Watermarks with Pseudorandom Collision
链接:https://arxiv.org/abs/2510.22007
作者:T. Tony Cai,Xiang Li,Qi Long,Weijie J. Su,Garrett G. Wen
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
关键词:Text watermarking plays, mitigating misuse, watermarking plays, plays a crucial, crucial role
备注:
点击查看摘要
Abstract:Text watermarking plays a crucial role in ensuring the traceability and accountability of large language model (LLM) outputs and mitigating misuse. While promising, most existing methods assume perfect pseudorandomness. In practice, repetition in generated text induces collisions that create structured dependence, compromising Type I error control and invalidating standard analyses. We introduce a statistical framework that captures this structure through a hierarchical two-layer partition. At its core is the concept of minimal units -- the smallest groups treatable as independent across units while permitting dependence within. Using minimal units, we define a non-asymptotic efficiency measure and cast watermark detection as a minimax hypothesis testing problem. Applied to Gumbel-max and inverse-transform watermarks, our framework produces closed-form optimal rules. It explains why discarding repeated statistics often improves performance and shows that within-unit dependence must be addressed unless degenerate. Both theory and experiments confirm improved detection power with rigorous Type I error control. These results provide the first principled foundation for watermark detection under imperfect pseudorandomness, offering both theoretical insight and practical guidance for reliable tracing of model outputs.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Machine Learning (stat.ML)
Cite as:
arXiv:2510.22007 [cs.LG]
(or
arXiv:2510.22007v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2510.22007
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
154. 【2510.21984】From Social Division to Cohesion with AI Message Suggestions in Online Chat Groups
链接:https://arxiv.org/abs/2510.21984
作者:Faria Huq,Elijah L. Claggett,Hirokazu Shirado
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
关键词:opinion diversity, difficult to sustain, sustain in societies, societies marked, marked by opinion
备注: Preprint, Under Review
点击查看摘要
Abstract:Social cohesion is difficult to sustain in societies marked by opinion diversity, particularly in online communication. As large language model (LLM)-driven messaging assistance becomes increasingly embedded in these contexts, it raises critical questions about its societal impact. We present an online experiment with 557 participants who engaged in multi-round discussions on politically controversial topics while freely reconfiguring their discussion groups. In some conditions, participants received real-time message suggestions generated by an LLM, either personalized to the individual or adapted to their group context. We find that subtle shifts in linguistic style during communication, mediated by AI assistance, can scale up to reshape collective structures. While individual-focused assistance leads users to segregate into like-minded groups, relational assistance that incorporates group members' stances enhances cohesion through more receptive exchanges. These findings demonstrate that AI-mediated communication can support social cohesion in diverse groups, but outcomes critically depend on how personalization is designed.
155. 【2510.21983】Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
链接:https://arxiv.org/abs/2510.21983
作者:Havva Alizadeh Noughabi,Julien Serbanescu,Fattane Zarrinkalam,Ali Dehghantanha
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models remain, Large Language, elicit harmful outputs, Models remain vulnerable
备注:
点击查看摘要
Abstract:Despite recent advances, Large Language Models remain vulnerable to jailbreak attacks that bypass alignment safeguards and elicit harmful outputs. While prior research has proposed various attack strategies differing in human readability and transferability, little attention has been paid to the linguistic and psychological mechanisms that may influence a model's susceptibility to such attacks. In this paper, we examine an interdisciplinary line of research that leverages foundational theories of persuasion from the social sciences to craft adversarial prompts capable of circumventing alignment constraints in LLMs. Drawing on well-established persuasive strategies, we hypothesize that LLMs, having been trained on large-scale human-generated text, may respond more compliantly to prompts with persuasive structures. Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive fingerprints that emerge in their jailbreak responses. Empirical evaluations across multiple aligned LLMs reveal that persuasion-aware prompts significantly bypass safeguards, demonstrating their potential to induce jailbreak behaviors. This work underscores the importance of cross-disciplinary insight in addressing the evolving challenges of LLM safety. The code and data are available.
156. 【2510.21970】Performance Trade-offs of Optimizing Small Language Models for E-Commerce
链接:https://arxiv.org/abs/2510.21970
作者:Josip Tomo Licardo,Nikola Tankovic
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, natural language understanding, Large Language, natural language, language understanding
备注: 15 pages, 9 figures
点击查看摘要
Abstract:Large Language Models (LLMs) offer state-of-the-art performance in natural language understanding and generation tasks. However, the deployment of leading commercial models for specialized tasks, such as e-commerce, is often hindered by high computational costs, latency, and operational expenses. This paper investigates the viability of smaller, open-weight models as a resource-efficient alternative. We present a methodology for optimizing a one-billion-parameter Llama 3.2 model for multilingual e-commerce intent recognition. The model was fine-tuned using Quantized Low-Rank Adaptation (QLoRA) on a synthetically generated dataset designed to mimic real-world user queries. Subsequently, we applied post-training quantization techniques, creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results demonstrate that the specialized 1B model achieves 99% accuracy, matching the performance of the significantly larger GPT-4.1 model. A detailed performance analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF formats on a CPU achieved a speedup of up to 18x in inference throughput and a reduction of over 90% in RAM consumption compared to the FP16 baseline. We conclude that small, properly optimized open-weight models are not just a viable but a more suitable alternative for domain-specific applications, offering state-of-the-art accuracy at a fraction of the computational cost.
157. 【2510.21961】Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing
链接:https://arxiv.org/abs/2510.21961
作者:Iskander Azangulov,Teodora Pandeva,Niranjani Prasad,Javier Zazo,Sushrut Karmalkar
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Masked diffusion models, Masked diffusion, offer a compelling, discrete text generation, compelling alternative
备注:
点击查看摘要
Abstract:Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means potentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates. We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing. Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16\% higher accuracy over baseline methods, including sequential generation (one-by-one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2510.21961 [cs.LG]
(or
arXiv:2510.21961v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2510.21961
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
158. 【2510.21958】A Stylometric Application of Large Language Models
链接:https://arxiv.org/abs/2510.21958
作者:Harrison F. Stropkay,Jiayi Chen,Mohammad J. Latifi,Daniel N. Rockmore,Jeremy R. Manning
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:large language models, show that large, large language, held-out text, language models
备注: All code and data needed to reproduce the results in this paper are available at [this https URL](https://github.com/ContextLab/llm-stylometry)
点击查看摘要
Abstract:We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one author, will predict held-out text from that author more accurately than held-out text from other authors. We suggest that, in this way, a model trained on one author's works embodies the unique writing style of that author. We first demonstrate our approach on books written by eight different (known) authors. We also use this approach to confirm R. P. Thompson's authorship of the well-studied 15th book of the Oz series, originally attributed to F. L. Baum.
159. 【2510.21956】ransformer Based Linear Attention with Optimized GPU Kernel Implementation
链接:https://arxiv.org/abs/2510.21956
作者:Armin Gerami,Ramani Duraiswami
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:successful Transformer architecture, Transformer architecture computes, extremely successful Transformer, original softmax-based attention, architecture computes attention
备注:
点击查看摘要
Abstract:The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.
160. 【2510.21954】Model-Aware Tokenizer Transfer
链接:https://arxiv.org/abs/2510.21954
作者:Mykola Haltiuk,Aleksander Smywiński-Pohl
类目:Computation and Language (cs.CL)
关键词:predefined tokenizers remain, tokenizer transfer, trained to support, support an increasing, increasing number
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.
161. 【2510.21909】Explaining and Mitigating Crosslingual Tokenizer Inequities
链接:https://arxiv.org/abs/2510.21909
作者:Catherine Arnett,Tyler A. Chang,Stella Biderman,Benjamin K. Bergen
类目:Computation and Language (cs.CL)
关键词:encode parallel text, token premiums, token premium effects, token, vocabulary size
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at inference. In this paper, we show that even after controlling for dataset size, vocabulary size, and data content, monolingual tokenizers exhibit a wide range of token premiums across languages. To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. We also investigate the role of language-specific features such as writing system and word length. We find that similarity between training and test data does not impact token premiums, but vocabulary size and pre-tokenization do. While simply increasing vocabulary size does not lead to reduced token premium effects, we can determine an ``optimal'' vocabulary size for each language to achieve significantly reduced token premium effects. We also train superword tokenizers which allow merges over whitespaces, and we find that they both reduce token premium effects and improve compression overall. Thus, intervening on the vocabulary size or the pre-tokenizer significantly reduces crosslingual token premium effects.
162. 【2510.21900】Deep Literature Survey Automation with an Iterative Workflow
链接:https://arxiv.org/abs/2510.21900
作者:Hongbo Zhang,Han Cui,Yidong Wang,Yijian Tian,Qi Guo,Cunxiang Wang,Jian Wu,Chiyu Song,Yue Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:attracted increasing attention, Automatic literature survey, existing systems follow, Automatic literature, literature survey generation
备注: Preprint version
点击查看摘要
Abstract:Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at this https URL\_Autosurveyv2.
163. 【2510.21894】Understanding Network Behaviors through Natural Language Question-Answering
链接:https://arxiv.org/abs/2510.21894
作者:Mingzhe Xing,Chang Tian,Jianan Zhang,Lichen Pan,Peipei Liu,Zhaoteng Yan,Yinliang Yue
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Modern large-scale networks, Modern large-scale, introduce significant complexity, increasing the risk, risk of misconfiguration
备注: Large Language Models
点击查看摘要
Abstract:Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM's long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.
164. 【2510.21891】Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation
链接:https://arxiv.org/abs/2510.21891
作者:Dhrupad Bhardwaj,Julia Kempe,Tim G. J. Rudner
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
关键词:deploy large language, substantively accurate responses, long-form responses generated, long-form responses, large language models
备注:
点击查看摘要
Abstract:To deploy large language models (LLMs) in high-stakes application domains that require substantively accurate responses to open-ended prompts, we need reliable, computationally inexpensive methods that assess the trustworthiness of long-form responses generated by LLMs. However, existing approaches often rely on claim-by-claim fact-checking, which is computationally expensive and brittle in long-form responses to open-ended prompts. In this work, we introduce semantic isotropy -- the degree of uniformity across normalized text embeddings on the unit sphere -- and use it to assess the trustworthiness of long-form responses generated by LLMs. To do so, we generate several long-form responses, embed them, and estimate the level of semantic isotropy of these responses as the angular dispersion of the embeddings on the unit sphere. We find that higher semantic isotropy -- that is, greater embedding dispersion -- reliably signals lower factual consistency across samples. Our approach requires no labeled data, no fine-tuning, and no hyperparameter selection, and can be used with open- or closed-weight embedding models. Across multiple domains, our method consistently outperforms existing approaches in predicting nonfactuality in long-form responses using only a handful of samples -- offering a practical, low-cost approach for integrating trust assessment into real-world LLM workflows.
165. 【2510.21885】Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
链接:https://arxiv.org/abs/2510.21885
作者:Anh Pham,Mihir Thalanki,Michael Sun,Aditya Chaloo,Ankita Gupta,Tian Xia,Aditya Mate,Ehimwenma Nosakhare,Soundararajan Srinivasan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, lose previously aligned, Large language, previously aligned safety, catastrophic forgetting
备注:
点击查看摘要
Abstract:Large language models often lose previously aligned safety behaviors when fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior work shows that adding random safety examples can mitigate this effect, but it remains unclear which examples are most effective. We propose a behavior-aware sampling framework that selects safety examples based on two complementary factors: instruction-response behavior (e.g., refusal versus compliance) and semantic diversity across harm categories. Systematic evaluation shows that this approach substantially reduces harmful outputs while maintaining helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5% additional training data. These results highlight how targeted data selection can improve the safety and efficiency of fine-tuning at scale.
166. 【2510.21884】Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks
链接:https://arxiv.org/abs/2510.21884
作者:Avinash Patil
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:interpretable artificial intelligence, machine learning, artificial intelligence, growing adoption, adoption of machine
备注: 12 Pages, 12 Figures, 2 tables
点击查看摘要
Abstract:The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.
167. 【2510.21883】Language Ranker: A Lightweight Ranking framework for LLM Decoding
链接:https://arxiv.org/abs/2510.21883
作者:Chenheng Zhang,Tianqi Du,Jizhe Zhang,Mingqing Xiao,Yifei Wang,Yisen Wang,Zhouchen Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:refining output distributions, Conventional research, large language models, output distributions, research on large
备注:
点击查看摘要
Abstract:Conventional research on large language models (LLMs) has primarily focused on refining output distributions, while paying less attention to the decoding process that transforms these distributions into final responses. Recent advances, such as scaling the computation of inference time with reward models, have underscored the importance of decoding, but these methods often suffer from high computational costs and limited applicability. In this paper, we revisit LLM generation through the lens of recommender systems, conceptualizing the decoding process as analogous to the ranking stage in recommendation pipelines. From this perspective, we observe that both traditional decoding methods and reward models exhibit clear limitations such as redundancy. Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses using features extracted by the base model. Experiments across a wide range of tasks show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters, significantly reducing the computational overhead during both training and inference stages. This highlights the efficiency and effectiveness of our method, showcasing its potential to fully unlock the capabilities of LLMs.
168. 【2510.21881】GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models
链接:https://arxiv.org/abs/2510.21881
作者:Nannan Shi,Chuanyu Qin,Shipeng Song,Man Luo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:problems present unique, geometric problem solving, problem solving, Large language models, mathematical problem solving
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated strong reasoning capabilities in text-based mathematical problem solving; however, when adapted to visual reasoning tasks, particularly geometric problem solving, their performance substantially declines because geometric problems present unique challenges. Specifically, these challenges stem from two key factors: first, the intrinsic complexity of geometry requiring detailed image comprehension and multi-step reasoning, and second, the limitations of existing datasets which lack sufficient scale, diversity, and explicit reasoning traces, consequently hindering effective model training. To address these challenges, we developed the GeoThoughts dataset, a comprehensive geometric reasoning corpus with two subsets: Geo-Thought-6K with 6,243 samples and its augmented version Geo-Thought-Augmented-10K containing 10,834 samples. Each entry includes visual descriptions, step-by-step solutions, explicit reasoning chains, reflection steps, and final answers. Using this dataset, we developed GeoThought-MLLM, a mathematical reasoning multimodal model that generates detailed thinking processes during problem-solving. Our model outperforms existing benchmarks in geometric tasks, demonstrating that training with our Chain-of-Thought dataset improves geometric reasoning capabilities across both in-domain and out-of-domain settings. Finally, we analyze failure cases and observe that errors primarily arise from incorrect interpretation of mathematical concepts or spatial misjudgment. By invoking CoT to correct these mistakes, the model produces correct answers.
169. 【2510.21861】he Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
链接:https://arxiv.org/abs/2510.21861
作者:Bentley DeVilling(Course Correct Labs, Independent Research Group)
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, external feedback frequently, feedback frequently yields, frequently yields reformulation, Large language
备注: 18 pages, 2 figures. Category: cs.LG. Code and data: [this https URL](https://github.com/Course-Correct-Labs/mirror-loop)
点击查看摘要
Abstract:Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.
170. 【2510.21855】SIGN: Schema-Induced Games for Naming
链接:https://arxiv.org/abs/2510.21855
作者:Ryan Zhang,Herbert Woisetscläger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:increasingly complex problems, tackling increasingly complex, large language model, complex problems, LLM
备注: AAAI 2026 Student Abstract (Oral). Code available ar [this https URL](https://github.com/ryanzhangofficial/schema-induced-games-for-naming)
点击查看摘要
Abstract:Real-world AI systems are tackling increasingly complex problems, often through interactions among large language model (LLM) agents. When these agents develop inconsistent conventions, coordination can break down. Applications such as collaborative coding and distributed planning therefore require reliable, consistent communication, and scalability is a central concern as systems grow. We introduce Schema-Induced Games for Naming (SIGN), a naming game that examines how lightweight structure can steer convention formation. We compare schema-induced communication to unconstrained natural language and find faster convergence with up to 5.8x higher agreement. These results suggest that minimal structure can act as a simple control knob for efficient multi-agent coordination, pointing toward broader applications beyond the naming game.
171. 【2510.21853】Policy Optimization Prefers The Path of Least Resistance
链接:https://arxiv.org/abs/2510.21853
作者:Debdeep Sanyal,Aakash Sen Sharma,Dhruv Kumar,Saurabh Deshpande,Murari Mandal
类目:Computation and Language (cs.CL)
关键词:refine Large Language, Large Language Models, Large Language, refine Large, Language Models
备注: 21 pages, 8 figures, 2 tables
点击查看摘要
Abstract:Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{answer}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt{thinkanswer} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.
172. 【2510.21850】SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
链接:https://arxiv.org/abs/2510.21850
作者:Gyubeum Lim,Yemo Koo,Vijay Krishna Madisetti
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:long-context visual information, visual information remains, Understanding long-context visual, GUI control, information remains
备注:
点击查看摘要
Abstract:Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
173. 【2510.21835】A Multimodal, Multitask System for Generating E Commerce Text Listings from Images
链接:https://arxiv.org/abs/2510.21835
作者:Nayan Kumar Singh
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Manually generating catchy, generating catchy descriptions, Manually generating, generating catchy, catchy descriptions
备注: 24 pages, 10 figures, 11 tables. Code can be found at: [this https URL](https://github.com/SinghNayanKumar/multimodal-product-lister/)
点击查看摘要
Abstract:Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
174. 【2510.21828】Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
链接:https://arxiv.org/abs/2510.21828
作者:Yichi Zhang,Zhuo Chen,Lingbing Guo,Lei Liang,Wen Zhang,Huajun Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:visual modality presents, modality presents significant, presents significant challenges, current multi-modal large, multi-modal large language
备注: Work in Progress. Code and data will be released at [this https URL](https://github.com/zjukg/STAR)
点击查看摘要
Abstract:Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.
175. 【2510.21817】VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
链接:https://arxiv.org/abs/2510.21817
作者:Xiaoyu Liu,Chaoyou Fu,Chi Yan,Chu Wu,Haihan Gao,Yi-Fan Zhang,Shaoqi Dong,Cheng Qian,Bin Luo,Xiuyong Yang,Guanwu Li,Yusheng Cai,Yunhang Shen,Deqiang Jiang,Haoyu Cao,Xing Sun,Caifeng Shan,Ran He
类目:Robotics (cs.RO); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:static interaction paradigm, user interruptions dynamically, lacks the ability, static interaction, Current
备注: Homepage: [this https URL](https://lxysl.github.io/VITA-E/)
点击查看摘要
Abstract:Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.
176. 【2510.21762】A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications
链接:https://arxiv.org/abs/2510.21762
作者:Eric Jeangirard
类目:Computation and Language (cs.CL); Digital Libraries (cs.DL)
关键词:clinical trial mentions, data mentions, code mentions, licensed scientific publications, trial mentions
备注:
点击查看摘要
Abstract:We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace this https URL under a CC-BY license.
177. 【2510.21740】Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
链接:https://arxiv.org/abs/2510.21740
作者:Alexa R. Tartaglini,Satchel Grant,Daniel Wurgaft,Christopher Potts,Judith E. Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:data visualization understanding, data visualization, Data, visualization understanding tasks, data points
备注:
点击查看摘要
Abstract:Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.
178. 【2510.21739】Next-Generation LLM for UAV: From Natural Language to Autonomous Flight
链接:https://arxiv.org/abs/2510.21739
作者:Liangqi Yuan,Chuhao Deng,Dong-Jun Han,Inseok Hwang,Sabine Brunswicker,Christopher G. Brinton
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
关键词:Unmanned Aerial Vehicle, Large Language Models, Aerial Vehicle, Unmanned Aerial, garnered increasing attention
备注:
点击查看摘要
Abstract:With the rapid advancement of Large Language Models (LLMs), their capabilities in various automation domains, particularly Unmanned Aerial Vehicle (UAV) operations, have garnered increasing attention. Current research remains predominantly constrained to small-scale UAV applications, with most studies focusing on isolated components such as path planning for toy drones, while lacking comprehensive investigation of medium- and long-range UAV systems in real-world operational contexts. Larger UAV platforms introduce distinct challenges, including stringent requirements for airport-based take-off and landing procedures, adherence to complex regulatory frameworks, and specialized operational capabilities with elevated mission expectations. This position paper presents the Next-Generation LLM for UAV (NeLV) system -- a comprehensive demonstration and automation roadmap for integrating LLMs into multi-scale UAV operations. The NeLV system processes natural language instructions to orchestrate short-, medium-, and long-range UAV missions through five key technical components: (i) LLM-as-Parser for instruction interpretation, (ii) Route Planner for Points of Interest (POI) determination, (iii) Path Planner for waypoint generation, (iv) Control Platform for executable trajectory implementation, and (v) UAV monitoring. We demonstrate the system's feasibility through three representative use cases spanning different operational scales: multi-UAV patrol, multi-POI delivery, and multi-hop relocation. Beyond the current implementation, we establish a five-level automation taxonomy that charts the evolution from current LLM-as-Parser capabilities (Level 1) to fully autonomous LLM-as-Autopilot systems (Level 5), identifying technical prerequisites and research challenges at each stage.
179. 【2510.21716】When Robots Say No: Temporal Trust Recovery Through Explanation
链接:https://arxiv.org/abs/2510.21716
作者:Nicola Webb,Zijun Huang,Sanja Milivojevic,Chris Baber,Edmund R. Hunt
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:deliver significant advantages, Mobile robots, degree of autonomy, autonomy could deliver, deliver significant
备注:
点击查看摘要
Abstract:Mobile robots with some degree of autonomy could deliver significant advantages in high-risk missions such as search and rescue and firefighting. Integrated into a human-robot team (HRT), robots could work effectively to help search hazardous buildings. User trust is a key enabler for HRT, but during a mission, trust can be damaged. With distributed situation awareness, such as when team members are working in different locations, users may be inclined to doubt a robot's integrity if it declines to immediately change its priorities on request. In this paper, we present the results of a computer-based study investigating on-mission trust dynamics in a high-stakes human-robot teaming scenario. Participants (n = 38) played an interactive firefighting game alongside a robot teammate, where a trust violation occurs owing to the robot declining to help the user immediately. We find that when the robot provides an explanation for declining to help, trust better recovers over time, albeit following an initial drop that is comparable to a baseline condition where an explanation for refusal is not provided. Our findings indicate that trust can vary significantly during a mission, notably when robots do not immediately respond to user requests, but that this trust violation can be largely ameliorated over time if adequate explanation is provided.
180. 【2510.21715】Beyond IVR Touch-Tones: Customer Intent Routing using LLMs
链接:https://arxiv.org/abs/2510.21715
作者:Sergio Rojas-Galeano
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Interactive Voice Response, touch-tone Interactive Voice, Voice Response, Interactive Voice, rigid touch-tone Interactive
备注: Accepted for publication in the Proceedings of the Workshop on Engineering Applications 2025 (WEA 2025)
点击查看摘要
Abstract:Widespread frustration with rigid touch-tone Interactive Voice Response (IVR) systems for customer service underscores the need for more direct and intuitive language interaction. While speech technologies are necessary, the key challenge lies in routing intents from user phrasings to IVR menu paths, a task where Large Language Models (LLMs) show strong potential. Progress, however, is limited by data scarcity, as real IVR structures and interactions are often proprietary. We present a novel LLM-based methodology to address this gap. Using three distinct models, we synthesized a realistic 23-node IVR structure, generated 920 user intents (230 base and 690 augmented), and performed the routing task. We evaluate two prompt designs: descriptive hierarchical menus and flattened path representations, across both base and augmented datasets. Results show that flattened paths consistently yield higher accuracy, reaching 89.13% on the base dataset compared to 81.30% with the descriptive format, while augmentation introduces linguistic noise that slightly reduces performance. Confusion matrix analysis further suggests that low-performing routes may reflect not only model limitations but also redundancies in menu design. Overall, our findings demonstrate proof-of-concept that LLMs can enable IVR routing through a smoother, more seamless user experience -- moving customer service one step ahead of touch-tone menus.
181. 【2510.21712】DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling
链接:https://arxiv.org/abs/2510.21712
作者:Hao Sun,Zile Qiao,Bo Wang,Guoxin Chen,Yingyan Hou,Yong Jiang,Pengjun Xie,Fei Huang,Yan Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:enhancing Large Language, Retrieval-Augmented Generation, Large Language Models, Agentic RAG, Agentic RAG introduces
备注: EMNLP 2025 Main Conference
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes, demonstrate the effectiveness of our method.
182. 【2510.19898】BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
链接:https://arxiv.org/abs/2510.19898
作者:Atharv Sonwane,Isadora White,Hyunji Lee,Matheus Pereira,Lucas Caccia,Minseon Kim,Zhengyan Shi,Chinmay Singh,Alessandro Sordoni,Marc-Alexandre Côté,Xingdi Yuan
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:High quality bugs, based software engineering, High quality, language model based, model based software
备注:
点击查看摘要
Abstract:High quality bugs are key to training the next generation of language model based software engineering (SWE) agents. We introduce a novel method for synthetic generation of difficult and diverse bugs. Our method instructs SWE Agents to introduce a feature into the codebase whereby they may unintentionally break tests, resulting in bugs. Prior approaches often induce an out-of-distribution effect by generating bugs intentionally (e.g. by introducing local perturbation to existing code), which does not reflect realistic development processes. We perform qualitative analysis to demonstrate that our approach for generating bugs more closely reflects the patterns found in human-authored edits. Through extensive experiments, we demonstrate that our bugs provide more efficient training data for supervised fine-tuning, outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k bugs). We train on our newly generated bugs in addition to existing bug datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over three seeds.
183. 【2510.23320】LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
链接:https://arxiv.org/abs/2510.23320
作者:Máté Gedeon,Péter Mihajlik
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
关键词:speaker-aware conversation simulation, automatic speech recognition, conversation simulation, designed to support, conversational dataset based
备注: Submitted to LREC 2026
点击查看摘要
Abstract:We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
184. 【2510.22588】UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
链接:https://arxiv.org/abs/2510.22588
作者:Wenming Tu,Guanrou Yang,Ruiqi Yan,Wenxi Chen,Ziyang Ma,Yipeng Kang,Kai Yu,Xie Chen,Zilong Zheng
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:purely functional capabilities, Spoken dialogue models, fine-grained speech style, Spoken dialogue, speech style control
备注: 23 pages, 4 figures
点击查看摘要
Abstract:Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: this https URL.
信息检索
1. 【2510.23544】LimRank: Less is More for Reasoning-Intensive Information Reranking
链接:https://arxiv.org/abs/2510.23544
作者:Tingyu Song,Yilun Zhao,Siyue Zhang,Chen Zhao,Arman Cohan
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Existing approaches typically, Existing approaches, approaches typically rely, computationally expensive, rely on large-scale
备注: EMNLP 2025 Main (Short)
点击查看摘要
Abstract:Existing approaches typically rely on large-scale fine-tuning to adapt LLMs for information reranking tasks, which is computationally expensive. In this work, we demonstrate that modern LLMs can be effectively adapted using only minimal, high-quality supervision. To enable this, we design LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating diverse, challenging, and realistic reranking examples. Using this synthetic data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and FollowIR for instruction-following retrieval. Our experiments demonstrate that LIMRANK achieves competitive performance, while being trained on less than 5% of the data typically used in prior work. Further ablation studies demonstrate the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization capabilities of LIMRANK across downstream tasks, including scientific literature search and retrieval-augmented generation for knowledge-intensive problem solving.
2. 【2510.23224】Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment
链接:https://arxiv.org/abs/2510.23224
作者:Hongyi Wang,Zhengjie Zhu,Jiabo Ma,Fang Wang,Yue Shi,Bo Luo,Jili Wang,Qiuyu Cai,Xiuming Zhang,Yen-Wei Chen,Lanfen Lin,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:rapid digitization, digitization of histopathology, possibilities for computational, computational tools, retrieval
备注:
点击查看摘要
Abstract:The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands out, enabling pathologists to identify morphologically and semantically similar cases, thereby supporting precise diagnoses, enhancing consistency across observers, and assisting example-based education. However, effective retrieval of whole slide images (WSIs) remains challenging due to their gigapixel scale and the difficulty of capturing subtle semantic differences amid abundant irrelevant content. To overcome these challenges, we present PathSearch, a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning. Trained on a corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained morphological cues and high-level semantic patterns to enable accurate and flexible retrieval. The framework supports two key functionalities: (1) mosaic-based image-to-image retrieval, ensuring accurate and efficient slide research; and (2) multi-modal retrieval, where text queries can directly retrieve relevant slides. PathSearch was rigorously evaluated on four public pathology datasets and three in-house cohorts, covering tasks including anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination, and grading across diverse organs such as breast, lung, kidney, liver, and stomach. External results show that PathSearch outperforms traditional image-to-image retrieval frameworks. A multi-center reader study further demonstrates that PathSearch improves diagnostic accuracy, boosts confidence, and enhances inter-observer agreement among pathologists in real clinical scenarios. These results establish PathSearch as a scalable and generalizable retrieval solution for digital pathology.
3. 【2510.23104】Leveraging Hierarchical Organization for Medical Multi-document Summarization
链接:https://arxiv.org/abs/2510.23104
作者:Yi-Li Hsu,Katelyn X. Mei,Lucy Lu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:managing cross-document relationships, requires effectively managing, effectively managing cross-document, cross-document relationships, Medical multi-document summarization
备注:
点击查看摘要
Abstract:Medical multi-document summarization (MDS) is a complex task that requires effectively managing cross-document relationships. This paper investigates whether incorporating hierarchical structures in the inputs of MDS can improve a model's ability to organize and contextualize information across documents compared to traditional flat summarization methods. We investigate two ways of incorporating hierarchical organization across three large language models (LLMs), and conduct comprehensive evaluations of the resulting summaries using automated metrics, model-based metrics, and domain expert evaluation of preference, understandability, clarity, complexity, relevance, coverage, factuality, and coherence. Our results show that human experts prefer model-generated summaries over human-written summaries. Hierarchical approaches generally preserve factuality, coverage, and coherence of information, while also increasing human preference for summaries. Additionally, we examine whether simulated judgments from GPT-4 align with human judgments, finding higher agreement along more objective evaluation facets. Our findings demonstrate that hierarchical structures can improve the clarity of medical summaries generated by models while maintaining content coverage, providing a practical way to improve human preference for generated summaries.
4. 【2510.23077】hink before Recommendation: Autonomous Reasoning-enhanced Recommender
链接:https://arxiv.org/abs/2510.23077
作者:Xiaoyu Kong,Junguang Jiang,Bin Liu,Ziru Xu,Han Zhu,Jian Xu,Bo Zheng,Jiancan Wu,Xiang Wang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:historical user-item interactions, learn user preferences, preferences from historical, core task, rating prediction tasks
备注: NeurIPS 2025 poster
点击查看摘要
Abstract:The core task of recommender systems is to learn user preferences from historical user-item interactions. With the rapid development of large language models (LLMs), recent research has explored leveraging the reasoning capabilities of LLMs to enhance rating prediction tasks. However, existing distillation-based methods suffer from limitations such as the teacher model's insufficient recommendation capability, costly and static supervision, and superficial transfer of reasoning ability. To address these issues, this paper proposes RecZero, a reinforcement learning (RL)-based recommendation paradigm that abandons the traditional multi-model and multi-stage distillation approach. Instead, RecZero trains a single LLM through pure RL to autonomously develop reasoning capabilities for rating prediction. RecZero consists of two key components: (1) "Think-before-Recommendation" prompt construction, which employs a structured reasoning template to guide the model in step-wise analysis of user interests, item features, and user-item compatibility; and (2) rule-based reward modeling, which adopts group relative policy optimization (GRPO) to compute rewards for reasoning trajectories and optimize the LLM. Additionally, the paper explores a hybrid paradigm, RecOne, which combines supervised fine-tuning with RL, initializing the model with cold-start reasoning samples and further optimizing it with RL. Experimental results demonstrate that RecZero and RecOne significantly outperform existing baseline methods on multiple benchmark datasets, validating the superiority of the RL paradigm in achieving autonomous reasoning-enhanced recommender systems.
5. 【2510.23066】Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models
链接:https://arxiv.org/abs/2510.23066
作者:Yichao Jin,Yushuo Wang,Qishuai Zhong,Kent Chiu Jin-Chun,Kenneth Zhu Ke,Donald MacDonald
类目:Information Retrieval (cs.IR)
关键词:Medium-sized Businesses, Small and Medium-sized, compliance of Small, essential sources, assessing the wealth
备注:
点击查看摘要
Abstract:Financial documents are essential sources of information for regulators, auditors, and financial institutions, particularly for assessing the wealth and compliance of Small and Medium-sized Businesses. However, SMB documents are often difficult to parse. They are rarely born digital and instead are distributed as scanned images that are none machine readable. The scans themselves are low in resolution, affected by skew or rotation, and often contain noisy backgrounds. These documents also tend to be heterogeneous, mixing narratives, tables, figures, and multilingual content within the same report. Such characteristics pose major challenges for automated information extraction, especially when relying on end to end large Vision Language Models, which are computationally expensive, sensitive to noise, and slow when applied to files with hundreds of pages. We propose a multistage pipeline that leverages traditional image processing models and OCR extraction, together with compact VLMs for structured field extraction of large-scale financial documents. Our approach begins with image pre-processing, including segmentation, orientation detection, and size normalization. Multilingual OCR is then applied to recover page-level text. Upon analyzing the text information, pages are retrieved for coherent sections. Finally, compact VLMs are operated within these narrowed-down scopes to extract structured financial indicators. Our approach is evaluated using an internal corpus of multi-lingual, scanned financial documents. The results demonstrate that compact VLMs, together with a multistage pipeline, achieves 8.8 times higher field level accuracy relative to directly feeding the whole document into large VLMs, only at 0.7 percent of the GPU cost and 92.6 percent less end-to-end service latency.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2510.23066 [cs.IR]
(or
arXiv:2510.23066v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2510.23066
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2510.23018】Improving Product Search Relevance with EAR-MP: A Solution for the CIKM 2025 AnalytiCup
链接:https://arxiv.org/abs/2510.23018
作者:JaeEun Lim,Soomin Kim,Jaeyong Seo,Iori Ono,Qimu Ran,Jae-woong Lee
类目:Information Retrieval (cs.IR)
关键词:Multilingual e-commerce search, user-generated queries, e-commerce search, search is challenging, challenging due
备注:
点击查看摘要
Abstract:Multilingual e-commerce search is challenging due to linguistic diversity and the noise inherent in user-generated queries. This paper documents the solution employed by our team (EAR-MP) for the CIKM 2025 AnalytiCup, which addresses two core tasks: Query-Category (QC) relevance and Query-Item (QI) relevance. Our approach first normalizes the multilingual dataset by translating all text into English, then mitigates noise through extensive data cleaning and normalization. For model training, we build on DeBERTa-v3-large and improve performance with label smoothing, self-distillation, and dropout. In addition, we introduce task-specific upgrades, including hierarchical token injection for QC and a hybrid scoring mechanism for QI. Under constrained compute, our method achieves competitive results, attaining an F1 score of 0.8796 on QC and 0.8744 on QI. These findings underscore the importance of systematic data preprocessing and tailored training strategies for building robust, resource-efficient multilingual relevance systems.
7. 【2510.22956】agging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts
链接:https://arxiv.org/abs/2510.22956
作者:Anwesan Pal,Karen Hovsepian,Tinghao Guo,Mengnan Zhao,Somendra Tripathi,Nikos Kanakaris,George Mihaila,Sumit Nigam
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:effective question answering, large language models, modern flagship large, flagship large language, revealed major limitations
备注: Paper accepted at EMNLP 2025
点击查看摘要
Abstract:Recent investigations into effective context lengths of modern flagship large language models (LLMs) have revealed major limitations in effective question answering (QA) and reasoning over long and complex contexts for even the largest and most impressive cadre of models. While approaches like retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to mitigate this issue, they are sensitive to chunking, embedding and retrieval strategies and models, and furthermore, rely on extensive pre-processing, knowledge acquisition and indexing steps. In this paper, we propose Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy that boosts LLM performance in long-context scenarios, without degrading and altering the integrity and composition of retrieved documents. We validate our hypothesis by augmenting two challenging and directly relevant question-answering benchmarks -- NoLima and NovelQA -- and show that tagging the context or even just adding tag definitions into QA prompts leads to consistent performance gains over the baseline -- up to 17% for 32K token contexts, and 2.9% in complex reasoning question-answering for multi-hop queries requiring knowledge across a wide span of text. Additional details are available at this https URL.
8. 【2510.22942】GTR-Mamba: Geometry-to-Tangent Routing for Hyperbolic POI Recommendation
链接:https://arxiv.org/abs/2510.22942
作者:Zhuoxuan Li,Jieyuan Pei,Tangwei Ye,Zhongyuan Lai,Zihan Liu,Fengyuan Xu,Qi Zhang,Liang Hu
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Location-Based Social Networks, modern Location-Based Social, Graph Neural Networks, Social Networks, provide personalized recommendations
备注: 14 pages, 8 figures, 4 tables, submitted to ICDE 2026
点击查看摘要
Abstract:Next Point-of-Interest (POI) recommendation is a critical task in modern Location-Based Social Networks (LBSNs), aiming to model the complex decision-making process of human mobility to provide personalized recommendations for a user's next check-in location. Existing POI recommendation models, predominantly based on Graph Neural Networks and sequential models, have been extensively studied. However, these models face a fundamental limitation: they struggle to simultaneously capture the inherent hierarchical structure of spatial choices and the dynamics and irregular shifts of user-specific temporal contexts. To overcome this limitation, we propose GTR-Mamba, a novel framework for cross-manifold conditioning and routing. GTR-Mamba leverages the distinct advantages of different mathematical spaces for different tasks: it models the static, tree-like preference hierarchies in hyperbolic geometry, while routing the dynamic sequence updates to a novel Mamba layer in the computationally stable and efficient Euclidean tangent space. This process is coordinated by a cross-manifold channel that fuses spatio-temporal information to explicitly steer the State Space Model (SSM), enabling flexible adaptation to contextual changes. Extensive experiments on three real-world datasets demonstrate that GTR-Mamba consistently outperforms state-of-the-art baseline models in next POI recommendation.
9. 【2510.22888】MGFRec: Towards Reinforced Reasoning Recommendation with Multiple Groundings and Feedback
链接:https://arxiv.org/abs/2510.22888
作者:Shihao Cai,Chongming Gao,Haoyan Liu,Wentao Shi,Jianshan Sun,Ruiming Tang,Fuli Feng
类目:Information Retrieval (cs.IR)
关键词:large language models, actual item space, require in-depth reasoning, actual item, reasoning-based recommendation tasks
备注:
点击查看摘要
Abstract:The powerful reasoning and generative capabilities of large language models (LLMs) have inspired researchers to apply them to reasoning-based recommendation tasks, which require in-depth reasoning about user interests and the generation of recommended items. However, previous reasoning-based recommendation methods have typically performed inference within the language space alone, without incorporating the actual item space. This has led to over-interpreting user interests and deviating from real items. Towards this research gap, we propose performing multiple rounds of grounding during inference to help the LLM better understand the actual item space, which could ensure that its reasoning remains aligned with real items. Furthermore, we introduce a user agent that provides feedback during each grounding step, enabling the LLM to better recognize and adapt to user interests. Comprehensive experiments conducted on three Amazon review datasets demonstrate the effectiveness of incorporating multiple groundings and feedback. These findings underscore the critical importance of reasoning within the actual item space, rather than being confined to the language space, for recommendation tasks.
10. 【2510.22865】Civic Ground Truth in News Recommenders: A Method for Public Value Scoring
链接:https://arxiv.org/abs/2510.22865
作者:James Meese,Kyle Herbertson
类目:Information Retrieval (cs.IR)
关键词:integrate normative goals, continues to explore, integrate normative, normative goals, editorial objectives
备注: Presented at NORMalize 2025: The Third Workshop on the Normative Design and Evaluation of Recommender Systems, co-located with the ACM Conference on Recommender Systems 2025 (RecSys 2025), Prague
点击查看摘要
Abstract:Research in news recommendation systems (NRS) continues to explore the best ways to integrate normative goals such as editorial objectives and public service values into existing systems. Prior efforts have incorporated expert input or audience feedback to quantify these values, laying the groundwork for more civic-minded recommender systems. This paper contributes to that trajectory, introducing a method for embedding civic values into NRS through large-scale, structured audience evaluations. The proposed civic ground truth approach aims to generate value-based labels through a nationally representative survey that are generalisable across a wider news corpus, using automated metadata enrichment.
11. 【2510.22739】REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
链接:https://arxiv.org/abs/2510.22739
作者:Yiwen Tang,Qiuyu Zhao,Zenghui Sun,Jinsong Lan,Xiaoyong Zhu,Bo Zheng,Kaifu Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Taobao e-commerce visual, Taobao e-commerce, behavior analysis reveals, e-commerce visual search, implicit intent
备注:
点击查看摘要
Abstract:In Taobao e-commerce visual search, user behavior analysis reveals a large proportion of no-click requests, suggesting diverse and implicit user intents. These intents are expressed in various forms and are difficult to mine and discover, thereby leading to the limited adaptability and lag in platform strategies. This greatly restricts users' ability to express diverse intents and hinders the scalability of the visual search system. This mismatch between user implicit intent expression and system response defines the User-SearchSys Intent Discrepancy. To alleviate the issue, we propose a novel framework REVISION. This framework integrates offline reasoning mining with online decision-making and execution, enabling adaptive strategies to solve implicit user demands. In the offline stage, we construct a periodic pipeline to mine discrepancies from historical no-click requests. Leveraging large models, we analyze implicit intent factors and infer optimal suggestions by jointly reasoning over query and product metadata. These inferred suggestions serve as actionable insights for refining platform strategies. In the online stage, REVISION-R1-3B, trained on the curated offline data, performs holistic analysis over query images and associated historical products to generate optimization plans and adaptively schedule strategies across the search pipeline. Our framework offers a streamlined paradigm for integrating large models with traditional search systems, enabling end-to-end intelligent optimization across information aggregation and user interaction. Experimental results demonstrate that our approach improves the efficiency of implicit intent mining from large-scale search logs and significantly reduces the no-click rate.
12. 【2510.22733】$\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
链接:https://arxiv.org/abs/2510.22733
作者:Qi Liu,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Jiaxin Mao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:real-world search applications, search applications, fundamental component, component in real-world, real-world search
备注: Code and models are avaliable at [this https URL](https://alibaba-nlp.github.io/E2Rank)
点击查看摘要
Abstract:Text embedding models serve as a fundamental component in real-world search applications. By mapping queries and documents into a shared embedding space, they deliver competitive retrieval performance with high efficiency. However, their ranking fidelity remains limited compared to dedicated rerankers, especially recent LLM-based listwise rerankers, which capture fine-grained query-document and document-document interactions. In this paper, we propose a simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient Embedding-based Ranking (also means Embedding-to-Rank), which extends a single text embedding model to perform both high-quality retrieval and listwise reranking through continued training under a listwise ranking objective, thereby achieving strong effectiveness with remarkable efficiency. By applying cosine similarity between the query and document embeddings as a unified ranking function, the listwise ranking prompt, which is constructed from the original query and its candidate documents, serves as an enhanced query enriched with signals from the top-K documents, akin to pseudo-relevance feedback (PRF) in traditional retrieval models. This design preserves the efficiency and representational quality of the base embedding model while significantly improving its reranking performance. Empirically, $\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR reranking benchmark and demonstrates competitive performance on the reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also show that the ranking training process improves embedding performance on the MTEB benchmark. Our findings indicate that a single embedding model can effectively unify retrieval and reranking, offering both computational efficiency and competitive ranking accuracy.
13. 【2510.22732】ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation
链接:https://arxiv.org/abs/2510.22732
作者:Jiali Cheng,Anjishnu Kumar,Roshan Lal,Rishi Rajasekaran,Hani Ramezani,Omar Zia Khan,Oleg Rokhlenko,Sunny Chiu-Webster,Gang Hua,Hadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Robotics (cs.RO)
关键词:produce inefficient execution, inefficient execution plans, execution plans due, neural network fine-tuning, Look-ahead Action Simulation
备注: 9 pages, NeurIPS 2025 Workshop on Language Agents and World Models
点击查看摘要
Abstract:We observe that current state-of-the-art web-agents are unable to effectively adapt to new environments without neural network fine-tuning, without which they produce inefficient execution plans due to a lack of awareness of the structure and dynamics of the new environment. To address this limitation, we introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action Simulation), a memory-augmented agent that is able to make plans grounded in a model of the environment by simulating the consequences of those actions in cognitive space. Our agent starts by building a "cognitive map" by performing a lightweight curiosity driven exploration of the environment. The planner proposes candidate actions; the simulator predicts their consequences in cognitive space; a critic analyzes the options to select the best roll-out and update the original plan; and a browser executor performs the chosen action. On the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9% success rate for the previously published state-of-the-art. Unlike previous systems, our modular architecture requires no website-specific LLM fine-tuning. Ablations show sizable drops without the world-model, hierarchical planner, and look-ahead-based replanner confirming their complementary roles within the design of our system
14. 【2510.22694】Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.22694
作者:Shu Zhao,Tianyi Shen,Nilesh Ahuja,Omesh Tickoo,Vijaykrishnan Narayanan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Multimodal Large Language, Multimodal Retrieval-Augmented Generation, Language Models, Multimodal Large
备注: Accepted at NeurIPS 2025 UniReps Workshop
点击查看摘要
Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
15. 【2510.22681】Diversification as Risk Minimization
链接:https://arxiv.org/abs/2510.22681
作者:Rikiya Takehi,Fernando Diaz,Tetsuya Sakai
类目:Information Retrieval (cs.IR)
关键词:tend to remember, search session, Users tend, intents, query
备注: Preprint, accepted at WSDM 2026 (Full Paper). 16 pages, 8 figures
点击查看摘要
Abstract:Users tend to remember failures of a search session more than its many successes. This observation has led to work on search robustness, where systems are penalized if they perform very poorly on some queries. However, this principle of robustness has been overlooked within a single query. An ambiguous or underspecified query (e.g., ``jaguar'') can have several user intents, where popular intents often dominate the ranking, leaving users with minority intents unsatisfied. Although the diversification literature has long recognized this issue, existing metrics only model the average relevance across intents and provide no robustness guarantees. More surprisingly, we show theoretically and empirically that many well-known diversification algorithms are no more robust than a naive, non-diversified algorithm. To address this critical gap, we propose to frame diversification as a risk-minimization problem. We introduce VRisk, which measures the expected risk faced by the least-served fraction of intents in a query. Optimizing VRisk produces a robust ranking, reducing the likelihood of poor user experiences. We then propose VRisker, a fast greedy re-ranker with provable approximation guarantees. Finally, experiments on NTCIR INTENT-2, TREC Web 2012, and MovieLens show the vulnerability of existing methods. VRisker reduces worst-case intent failures by up to 33% with a minimal 2% drop in average performance.
16. 【2510.22670】ools are under-documented: Simple Document Expansion Boosts Tool Retrieval
链接:https://arxiv.org/abs/2510.22670
作者:Xuan Lu,Haohang Huang,Rui Meng,Yaohui Jin,Wenjun Zeng,Xiaoyu Shen
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, recently demonstrated strong, demonstrated strong capabilities, heterogeneous tool documentation
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have recently demonstrated strong capabilities in tool use, yet progress in tool retrieval remains hindered by incomplete and heterogeneous tool documentation. To address this challenge, we introduce Tool-DE, a new benchmark and framework that systematically enriches tool documentation with structured fields to enable more effective tool retrieval, together with two dedicated models, Tool-Embed and Tool-Rank. We design a scalable document expansion pipeline that leverages both open- and closed-source LLMs to generate, validate, and refine enriched tool profiles at low cost, producing large-scale corpora with 50k instances for embedding-based retrievers and 200k for rerankers. On top of this data, we develop two models specifically tailored for tool retrieval: Tool-Embed, a dense retriever, and Tool-Rank, an LLM-based reranker. Extensive experiments on ToolRet and Tool-DE demonstrate that document expansion substantially improves retrieval performance, with Tool-Embed and Tool-Rank achieving new state-of-the-art results on both benchmarks. We further analyze the contribution of individual fields to retrieval effectiveness, as well as the broader impact of document expansion on both training and evaluation. Overall, our findings highlight both the promise and limitations of LLM-driven document expansion, positioning Tool-DE, along with the proposed Tool-Embed and Tool-Rank, as a foundation for future research in tool retrieval.
17. 【2510.22590】ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
链接:https://arxiv.org/abs/2510.22590
作者:Yassir Lairgi,Ludovic Moncla,Khalid Benabdeslem,Rémy Cazabet,Pierre Cléau
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:today rapidly expanding, expanding data landscape, rapidly expanding data, dynamic memory frameworks, Temporal Knowledge Graphs
备注:
点击查看摘要
Abstract:In today's rapidly expanding data landscape, knowledge extraction from unstructured text is vital for real-time analytics, temporal inference, and dynamic memory frameworks. However, traditional static knowledge graph (KG) construction often overlooks the dynamic and time-sensitive nature of real-world data, limiting adaptability to continuous changes. Moreover, recent zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance on prebuilt ontologies often suffer from instability across multiple runs, as well as incomplete coverage of key facts. To address these challenges, we introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that builds and continuously updates Temporal Knowledge Graphs (TKGs) from unstructured texts. ATOM splits input documents into minimal, self-contained "atomic" facts, improving extraction exhaustivity and stability. Then, it constructs atomic TKGs from these facts while employing a dual-time modeling that distinguishes when information is observed from when it is valid. The resulting atomic TKGs are subsequently merged in parallel. Empirical evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17% better stability, and over 90% latency reduction compared to baseline methods, demonstrating a strong scalability potential for dynamic TKG construction.
18. 【2510.22521】Open Multimodal Retrieval-Augmented Factual Image Generation
链接:https://arxiv.org/abs/2510.22521
作者:Yang Tian,Fan Liu,Jingyuan Zhang,Wei Bi,Yupeng Hu,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Multimodal Models, achieved remarkable progress, involve fine-grained attributes, contradict verifiable knowledge, prompts involve fine-grained
备注: Preprint
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
19. 【2510.22344】FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.22344
作者:Mohammad Aghajani Asl,Majid Asgari-Bidhendi,Behrooz Minaei-Bidgoli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, Language Models, Large Language, staleness in Large, require synthesizing information
备注: 30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop Question Answering, Faithfulness
点击查看摘要
Abstract:While Retrieval-Augmented Generation (RAG) mitigates hallucination and knowledge staleness in Large Language Models (LLMs), existing frameworks often falter on complex, multi-hop queries that require synthesizing information from disparate sources. Current advanced RAG methods, employing iterative or adaptive strategies, lack a robust mechanism to systematically identify and fill evidence gaps, often propagating noise or failing to gather a comprehensive context. We introduce FAIR-RAG, a novel agentic framework that transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning process. At its core is an Iterative Refinement Cycle governed by a module we term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating mechanism: it deconstructs the initial query into a checklist of required findings and audits the aggregated evidence to identify confirmed facts and, critically, explicit informational gaps. These gaps provide a precise signal to an Adaptive Query Refinement agent, which generates new, targeted sub-queries to retrieve missing information. This cycle repeats until the evidence is verified as sufficient, ensuring a comprehensive context for a final, strictly faithful generation. We conducted experiments on challenging multi-hop QA benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified experimental setup, FAIR-RAG significantly outperforms strong baselines. On HotpotQA, it achieves an F1-score of 0.453 -- an absolute improvement of 8.3 points over the strongest iterative baseline -- establishing a new state-of-the-art for this class of methods on these benchmarks. Our work demonstrates that a structured, evidence-driven refinement process with explicit gap analysis is crucial for unlocking reliable and accurate reasoning in advanced RAG systems for complex, knowledge-intensive tasks.
20. 【2510.22264】PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
链接:https://arxiv.org/abs/2510.22264
作者:Iliass Ayaou,Denis Cavallucci
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:prior art search, capture patent-specific challenges, enable prior art, inadequately capture patent-specific, technology landscaping
备注:
点击查看摘要
Abstract:Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic ablations reveal that multi-task training improves external generalization despite minor benchmark costs, and that domain-pretrained initialization provides consistent advantages across task families. All resources will be made available at this https URL. Keywords: patent retrieval, sentence embeddings, multi-task learning, asymmetric retrieval, benchmark evaluation, contrastive learning.
21. 【2510.22242】PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
链接:https://arxiv.org/abs/2510.22242
作者:Yutao Wu,Xiao Liu,Yunhao Feng,Jiale Ding,Xingjun Ma
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, tasks remains under-evaluated, increasingly serve
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) increasingly serve as research assistants, yet their reliability in scholarly tasks remains under-evaluated. In this work, we introduce PaperAsk, a benchmark that systematically evaluates LLMs across four key research tasks: citation retrieval, content extraction, paper discovery, and claim verification. We evaluate GPT-4o, GPT-5, and Gemini-2.5-Flash under realistic usage conditions-via web interfaces where search operations are opaque to the user. Through controlled experiments, we find consistent reliability failures: citation retrieval fails in 48-98% of multi-reference queries, section-specific content extraction fails in 72-91% of cases, and topical paper discovery yields F1 scores below 0.32, missing over 60% of relevant literature. Further human analysis attributes these failures to the uncontrolled expansion of retrieved context and the tendency of LLMs to prioritize semantically relevant text over task instructions. Across basic tasks, the LLMs display distinct failure behaviors: ChatGPT often withholds responses rather than risk errors, whereas Gemini produces fluent but fabricated answers. To address these issues, we develop lightweight reliability classifiers trained on PaperAsk data to identify unreliable outputs. PaperAsk provides a reproducible and diagnostic framework for advancing the reliability evaluation of LLM-based scholarly assistance systems.
22. 【2510.22215】Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy
链接:https://arxiv.org/abs/2510.22215
作者:Juyeon Kim,Geon Lee,Dongwon Choi,Taeuk Kim,Kijung Shin
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:enterprise knowledge management, scientific search, legal discovery, knowledge management, documents is essential
备注:
点击查看摘要
Abstract:Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDOC, the first benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: this https URL
23. 【2510.22101】Scaling Up Efficient Small Language Models Serving and Deployment for Semantic Job Search
链接:https://arxiv.org/abs/2510.22101
作者:Kayhan Behdin,Qingquan Song,Sriram Vasudevan,Jian Sheng,Xiaojing Ma,Z Zhou,Chuanrui Zhu,Guoyao Li,Chanh Nguyen,Sayan Ghosh,Hejian Sang,Ata Fatahi Baarzi,Sundara Raman Ramachandran,Xiaoqing Wang,Qing Lan,Vinay Y S,Qi Guo,Caleb Johnson,Zhipeng Wang,Fedor Borisyuk
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Small Language Model, demonstrated impressive quality, demonstrated impressive
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to $40\%$ while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by up to $10$x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system's throughput by $10$x in a real-world deployment, while meeting our quality bar.
24. 【2510.22055】A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition
链接:https://arxiv.org/abs/2510.22055
作者:V Venktesh,Deepali Prabhu,Avishek Anand
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Fact-checking numerical claims, numbers provide mirage, fake potentially causing, potentially causing catastrophic, causing catastrophic impacts
备注: 16 pages
点击查看摘要
Abstract:Fact-checking numerical claims is critical as the presence of numbers provide mirage of veracity despite being fake potentially causing catastrophic impacts on society. The prior works in automatic fact verification do not primarily focus on natural numerical claims. A typical human fact-checker first retrieves relevant evidence addressing the different numerical aspects of the claim and then reasons about them to predict the veracity of the claim. Hence, the search process of a human fact-checker is a crucial skill that forms the foundation of the verification process. Emulating a real-world setting is essential to aid in the development of automated methods that encompass such skills. However, existing benchmarks employ heuristic claim decomposition approaches augmented with weakly supervised web search to collect evidences for verifying claims. This sometimes results in less relevant evidences and noisy sources with temporal leakage rendering a less realistic retrieval setting for claim verification. Hence, we introduce QuanTemp++: a dataset consisting of natural numerical claims, an open domain corpus, with the corresponding relevant evidence for each claim. The evidences are collected through a claim decomposition process approximately emulating the approach of human fact-checker and veracity labels ensuring there is no temporal leakage. Given this dataset, we also characterize the retrieval performance of key claim decomposition paradigms. Finally, we observe their effect on the outcome of the verification pipeline and draw insights. The code for data pipeline along with link to data can be found at this https URL
25. 【2510.22049】Massive Memorization with Hundreds of Trillions of Parameters for Sequential Transducer Generative Recommenders
链接:https://arxiv.org/abs/2510.22049
作者:Zhimin Chen,Chenyu Zhao,Ka Chun Mo,Yunjiang Jiang,Jane H. Lee,Shouwei Chen,Khushhall Chandra Mahajan,Ning Jiang,Kai Ren,Jinhui Li,Wen-Yun Yang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Modern large-scale recommendation, Modern large-scale, systems rely heavily, rely heavily, recommendation systems rely
备注:
点击查看摘要
Abstract:Modern large-scale recommendation systems rely heavily on user interaction history sequences to enhance the model performance. The advent of large language models and sequential modeling techniques, particularly transformer-like architectures, has led to significant advancements recently (e.g., HSTU, SIM, and TWIN models). While scaling to ultra-long user histories (10k to 100k items) generally improves model performance, it also creates significant challenges on latency, queries per second (QPS) and GPU cost in industry-scale recommendation systems. Existing models do not adequately address these industrial scalability issues. In this paper, we propose a novel two-stage modeling framework, namely VIrtual Sequential Target Attention (VISTA), which decomposes traditional target attention from a candidate item to user history items into two distinct stages: (1) user history summarization into a few hundred tokens; followed by (2) candidate item attention to those tokens. These summarization token embeddings are then cached in storage system and then utilized as sequence features for downstream model training and inference. This novel design for scalability enables VISTA to scale to lifelong user histories (up to one million items) while keeping downstream training and inference costs fixed, which is essential in industry. Our approach achieves significant improvements in offline and online metrics and has been successfully deployed on an industry leading recommendation platform serving billions of users.
26. 【2510.22023】Multimodal Item Scoring for Natural Language Recommendation via Gaussian Process Regression with LLM Relevance Judgments
链接:https://arxiv.org/abs/2510.22023
作者:Yifan Liu,Qianfeng Wen,Jiazhou Liang,Mark Zhao,Justin Cui,Anton Korikov,Armin Torogh,Junyoung Kim,Scott Sanner
类目:Information Retrieval (cs.IR)
关键词:Natural Language Recommendation, Natural Language, Language Recommendation, generates item suggestions, item suggestions based
备注: 16 pages,20 figures
点击查看摘要
Abstract:Natural Language Recommendation (NLRec) generates item suggestions based on the relevance between user-issued NL requests and NL item description passages. Existing NLRec approaches often use Dense Retrieval (DR) to compute item relevance scores from aggregation of inner products between user request embeddings and relevant passage embeddings. However, DR views the request as the sole relevance label, thus leading to a unimodal scoring function centered on the query embedding that is often a weak proxy for query relevance. To better capture the potential multimodal distribution of the relevance scoring function that may arise from complex NLRec data, we propose GPR-LLM that uses Gaussian Process Regression (GPR) with LLM relevance judgments for a subset of candidate passages. Experiments on four NLRec datasets and two LLM backbones demonstrate that GPR-LLM with an RBF kernel, capable of modeling multimodal relevance scoring functions, consistently outperforms simpler unimodal kernels (dot product, cosine similarity), as well as baseline methods including DR, cross-encoder, and pointwise LLM-based relevance scoring by up to 65%. Overall, GPR-LLM provides an efficient and effective approach to NLRec within a minimal LLM labeling budget.
27. 【2510.21962】mporal Graph Theoretic Analysis of Geopolitical Dynamics in the U.S. Entity List
链接:https://arxiv.org/abs/2510.21962
作者:Yunsen Lei,Kexin Bai,Quan Li,H. Howie Huang
类目:Information Retrieval (cs.IR)
关键词:America most prominent, economic statecraft, prominent tools, tools of economic, Entity List
备注: 13 pages, 9 figures. Under review
点击查看摘要
Abstract:Export controls have become one of America's most prominent tools of economic statecraft. They aim to block rival countries' access to sensitive technologies, safeguard U.S. supply chains, protect national security, and shape geopolitical competition. Among various instruments, the U.S. Entity List has emerged as the most salient, yet its dynamics remain underexplored. This paper introduces a novel temporal graph framework that transforms the Entity List documents from a static registry of foreign entities of concern into a dynamic representation of geopolitical strategy. We construct the first event-based dataset of U.S. government foreign entity designations and model them as a temporal bipartite graph. Building on this representation, we develop a multi-level analytical approach that reveals shifting roles, enforcement strategy, and broader sanction ecosystems. Applied to 25 years of data, the framework uncovers dynamic patterns of escalation, persistence, and coordination that static views cannot capture. More broadly, our study demonstrates how temporal graph analysis offers systematic computational insights into the geopolitical dynamics of export controls.
28. 【2510.21862】A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model
链接:https://arxiv.org/abs/2510.21862
作者:Muhammad Tayyab Khan,Zane Yong,Lequn Chen,Wenhe Feng,Nicholas Yew Jin Tan,Seung Ki Moon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:conveying design intent, design intent, production details, primary medium, medium for conveying
备注: This draft has been submitted to the 13th International Conference on Industrial Engineering and Applications (ICIEA 2026)
点击查看摘要
Abstract:Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GDT symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GDT frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.
29. 【2510.21831】Development of an Automated Web Application for Efficient Web Scraping: Design and Implementation
链接:https://arxiv.org/abs/2510.21831
作者:Alok Dutta,Nilanjana Roy,Rhythm Sen,Sougata Dutta,Prabhat Das
类目:Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:web scraping, web scraping process, presents the design, design and implementation, simplifies and optimizes
备注:
点击查看摘要
Abstract:This paper presents the design and implementation of a user-friendly, automated web application that simplifies and optimizes the web scraping process for non-technical users. The application breaks down the complex task of web scraping into three main stages: fetching, extraction, and execution. In the fetching stage, the application accesses target websites using the HTTP protocol, leveraging the requests library to retrieve HTML content. The extraction stage utilizes powerful parsing libraries like BeautifulSoup and regular expressions to extract relevant data from the HTML. Finally, the execution stage structures the data into accessible formats, such as CSV, ensuring the scraped content is organized for easy use. To provide personalized and secure experiences, the application includes user registration and login functionalities, supported by MongoDB, which stores user data and scraping history. Deployed using the Flask framework, the tool offers a scalable, robust environment for web scraping. Users can easily input website URLs, define data extraction parameters, and download the data in a simplified format, without needing technical expertise. This automated tool not only enhances the efficiency of web scraping but also democratizes access to data extraction by empowering users of all technical levels to gather and manage data tailored to their needs. The methodology detailed in this paper represents a significant advancement in making web scraping tools accessible, efficient, and easy to use for a broader audience.
30. 【2510.21825】10 Simple Rules for Improving Your Standardized Fields and Terms
链接:https://arxiv.org/abs/2510.21825
作者:Rhiannon Cameron(1),Emma Griffiths(1),Damion Dooley(1),William Hsiao(1) ((1) Centre for Infectious Disease Genomics and One Health, Faculty of Health Sciences, Simon Fraser University, Burnaby, BC, Canada)
类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)
关键词:unsung hero, contextual data harmonization, data, structured vocabularies make, Abstract
备注: 17 pages, 1 figure Author Contributions: Conceptualization by EG and RC. Manuscript writing by RC. Revisions and Editing by RC, EG, DD, and WH. Acknowledgements: Charlotte Barclay
点击查看摘要
Abstract:Contextual metadata is the unsung hero of research data. When done right, standardized and structured vocabularies make your data findable, shareable, and reusable. When done wrong, they turn a well intended effort into data cleanup and curation nightmares. In this paper we tackle the surprisingly tricky process of vocabulary standardization with a mix of practical advice and grounded examples. Drawing from real-world experience in contextual data harmonization, we highlight common challenges (e.g., semantic noise and concept bombs) and provide actionable strategies to address them. Our rules emphasize alignment with Findability, Accessibility, Interoperability, and Reusability (FAIR) principles while remaining adaptable to evolving user and research needs. Whether you are curating datasets, designing a schema, or contributing to a standards body, these rules aim to help you create metadata that is not only technically sound but also meaningful to users.
31. 【2510.21812】Unifying Inductive, Cross-Domain, and Multimodal Learning for Robust and Generalizable Recommendation
链接:https://arxiv.org/abs/2510.21812
作者:Chanyoung Chung,Kyeongryul Lee,Sunbin Park,Joyce Jiyoung Whang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:incorporating diverse information, diverse information sources, Recommender systems, information sources, systems have long
备注: 7 pages, 3 figures, and 4 tables. International Workshop on Multimodal Generative Search and Recommendation (MMGenSR) at The 34th ACM International Conference on Information and Knowledge Management (CIKM 2025)
点击查看摘要
Abstract:Recommender systems have long been built upon the modeling of interactions between users and items, while recent studies have sought to broaden this paradigm by generalizing to new users and items, incorporating diverse information sources, and transferring knowledge across domains. Nevertheless, these efforts have largely focused on individual aspects, hindering their ability to tackle the complex recommendation scenarios that arise in daily consumptions across diverse domains. In this paper, we present MICRec, a unified framework that fuses inductive modeling, multimodal guidance, and cross-domain transfer to capture user contexts and latent preferences in heterogeneous and incomplete real-world data. Moving beyond the inductive backbone of INMO, our model refines expressive representations through modality-based aggregation and alleviates data sparsity by leveraging overlapping users as anchors across domains, thereby enabling robust and generalizable recommendation. Experiments show that MICRec outperforms 12 baselines, with notable gains in domains with limited training data.
32. 【2510.21805】DiffGRM: Diffusion-based Generative Recommendation Model
链接:https://arxiv.org/abs/2510.21805
作者:Zhao Liu,Yichen Zhu,Yiqing Yang,Guoping Tang,Rui Huang,Qiang Luo,Xiao Lv,Ruiming Tang,Kun Gai,Guorui Zhou
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:user history, SID conditioned, emerging paradigm, paradigm that represents, autoregressively generating
备注: 13 pages, 5 figures
点击查看摘要
Abstract:Generative recommendation (GR) is an emerging paradigm that represents each item via a tokenizer as an n-digit semantic ID (SID) and predicts the next item by autoregressively generating its SID conditioned on the user's history. However, two structural properties of SIDs make ARMs ill-suited. First, intra-item consistency: the n digits jointly specify one item, yet the left-to-right causality trains each digit only under its prefix and blocks bidirectional cross-digit evidence, collapsing supervision to a single causal path. Second, inter-digit heterogeneity: digits differ in semantic granularity and predictability, while the uniform next-token objective assigns equal weight to all digits, overtraining easy digits and undertraining hard digits. To address these two issues, we propose DiffGRM, a diffusion-based GR model that replaces the autoregressive decoder with a masked discrete diffusion model (MDM), thereby enabling bidirectional context and any-order parallel generation of SID digits for recommendation. Specifically, we tailor DiffGRM in three aspects: (1) tokenization with Parallel Semantic Encoding (PSE) to decouple digits and balance per-digit information; (2) training with On-policy Coherent Noising (OCN) that prioritizes uncertain digits via coherent masking to concentrate supervision on high-value signals; and (3) inference with Confidence-guided Parallel Denoising (CPD) that fills higher-confidence digits first and generates diverse Top-K candidates. Experiments show consistent gains over strong generative and discriminative recommendation baselines on multiple datasets, improving NDCG@10 by 6.9%-15.5%. Code is available at this https URL.
33. 【2510.21737】From Factoid Questions to Data Product Requests: Benchmarking Data Product Discovery over Tables and Text
链接:https://arxiv.org/abs/2510.21737
作者:Liangliang Zhang,Nandana Mihindukulasooriya,Niharika S. D'Souza,Sola Shirai,Sarthak Dash,Yao Ma,Horst Samulowitz
类目:Information Retrieval (cs.IR)
关键词:self-contained assets designed, Data Product, data product discovery, Data Product Requests, Data
备注: 9 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Data products are reusable, self-contained assets designed for specific business use cases. Automating their discovery and generation is of great industry interest, as it enables discovery in large data lakes and supports analytical Data Product Requests (DPRs). Currently, there is no benchmark established specifically for data product discovery. Existing datasets focus on answering single factoid questions over individual tables rather than collecting multiple data assets for broader, coherent products. To address this gap, we introduce DPBench, the first user-request-driven data product benchmark over hybrid table-text corpora. Our framework systematically repurposes existing table-text QA datasets by clustering related tables and passages into coherent data products, generating professional-level analytical requests that span both data sources, and validating benchmark quality through multi-LLM evaluation. DPBench preserves full provenance while producing actionable, analyst-like data product requests. Baseline experiments with hybrid retrieval methods establish the feasibility of DPR evaluation, reveal current limitations, and point to new opportunities for automatic data product discovery research. Code and datasets are available at: this https URL
Comments:
9 pages, 1 figure, 2 tables
Subjects:
Information Retrieval (cs.IR)
MSC classes:
68T30, 68T50
ACMclasses:
I.2.7; I.2.4; H.3.3
Cite as:
arXiv:2510.21737 [cs.IR]
(or
arXiv:2510.21737v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2510.21737
Focus to learn more
arXiv-issued DOI via DataCite</p>
34. 【2510.21733】Augmenting Researchy Questions with Sub-question Judgments
链接:https://arxiv.org/abs/2510.21733
作者:Jia-Huei Ju,Eugene Yang,Trevor Adriaanse,Andrew Yates
类目:Information Retrieval (cs.IR)
关键词:Researchy Questions dataset, Researchy Questions, require retrieving information, require retrieving, question queries
备注: 3 pages
点击查看摘要
Abstract:The Researchy Questions dataset provides about 100k question queries with complex information needs that require retrieving information about several aspects of a topic. Each query in ResearchyQuestions is associated with sub-questions that were produced by prompting GPT-4. While ResearchyQuestions contains labels indicating what documents were clicked after issuing the query, there are no associations in the dataset between sub-questions and relevant documents. In this work, we augment the Researchy Questions dataset with LLM-judged labels for each sub-question using a Llama3.3 70B model. We intend these sub-question labels to serve as a resource for training retrieval models that better support complex information needs.
35. 【2510.21730】riMat: Context-aware Recommendation by Tri-Matrix Factorization
链接:https://arxiv.org/abs/2510.21730
作者:Hao Wang
类目:Information Retrieval (cs.IR)
关键词:technology of Web, frontier of Web, Search engine, Context-aware Recommender Systems, recommender systems
备注:
点击查看摘要
Abstract:Search engine is the symbolic technology of Web 2.0, and many people used to believe recommender systems is the new frontier of Web 3.0. In the past 10 years, with the advent of TikTok and similar apps, recommender systems has materialized the vision of the machine learning pioneers. However, many research topics of the field remain unfixed until today. One such topic is CARS (Context-aware Recommender Systems) , which is largely a theoretical topic without much advance in real-world applications. In this paper, we utilize tri-matrix factorization technique to incorporate contextual information into our matrix factorization framework, and prove that our technique is effective in improving both the accuracy and fairness metrics in our experiments.
36. 【2510.21729】CustomIR: Unsupervised Fine-Tuning of Dense Embeddings for Known Document Corpora
链接:https://arxiv.org/abs/2510.21729
作者:Nathan Paull
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Dense embedding models, Dense embedding, modern information retrieval, pre-training distribution, critical for modern
备注:
点击查看摘要
Abstract:Dense embedding models have become critical for modern information retrieval, particularly in RAG pipelines, but their performance often degrades when applied to specialized corpora outside their pre-training distribution. To address thi we introduce CustomIR, a framework for unsupervised adaptation of pre-trained language embedding models to domain-specific corpora using synthetically generated query-document pairs. CustomIR leverages large language models (LLMs) to create diverse queries grounded in a known target corpus, paired with LLM-verified hard negatives, eliminating the need for costly human annotation. Experiments on enterprise email and messaging datasets show that CustomIR consistently improves retrieval effectiveness with small models gaining up to 2.3 points in Recall@10. This performance increase allows these small models to rival the performance of much larger alternatives, allowing for cheaper RAG deployments. These results highlight that targeted synthetic fine-tuning offers a scalable and cost-efficient strategy for increasing domain-specific performance.
37. 【2510.21728】Modeling Bias Evolution in Fashion Recommender Systems: A System Dynamics Approach
链接:https://arxiv.org/abs/2510.21728
作者:Mahsa Goodarzi,M. Abdullah Canbaz
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:existing societal stereotypes, amplifies existing societal, Fashion Recommender Systems, fashion e-commerce, societal stereotypes
备注: Published in the proceedings of the 43rd International System Dynamics Conference (ISDC 25): [this https URL](https://proceedings.systemdynamics.org/2025/papers/P1254.pdf)
点击查看摘要
Abstract:Bias in recommender systems not only distorts user experience but also perpetuates and amplifies existing societal stereotypes, particularly in sectors like fashion e-commerce. This study employs a dynamic modeling approach to scrutinize the mechanisms of bias activation and reinforcement within Fashion Recommender Systems (FRS). By leveraging system dynamics modeling and experimental simulations, we dissect the temporal evolution of bias and its multifaceted impacts on system performance. Our analysis reveals that inductive biases exert a more substantial influence on system outcomes than user biases, suggesting critical areas for intervention. We demonstrate that while current debiasing strategies, including data rebalancing and algorithmic regularization, are effective to an extent, they require further enhancement to comprehensively mitigate biases. This research underscores the necessity for advancing these strategies and extending system boundaries to incorporate broader contextual factors such as user demographics and item diversity, aiming to foster inclusivity and fairness in FRS. The findings advocate for a proactive approach in recommender system design to counteract bias propagation and ensure equitable user experiences.
38. 【2510.21727】Your Dense Retriever is Secretly an Expeditious Reasoner
链接:https://arxiv.org/abs/2510.21727
作者:Yichi Zhang,Jun Bai,Zhixin Cai,Shuhan Qin,Zhuofan Chen,Jinghua Guan,Wenge Rong
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Dense retrievers enhance, retrievers enhance retrieval, continuous vectors, retrievers enhance
备注: 16 pages, 11 figures
点击查看摘要
Abstract:Dense retrievers enhance retrieval by encoding queries and documents into continuous vectors, but they often struggle with reasoning-intensive queries. Although Large Language Models (LLMs) can reformulate queries to capture complex reasoning, applying them universally incurs significant computational cost. In this work, we propose Adaptive Query Reasoning (AdaQR), a hybrid query rewriting framework. Within this framework, a Reasoner Router dynamically directs each query to either fast dense reasoning or deep LLM reasoning. The dense reasoning is achieved by the Dense Reasoner, which performs LLM-style reasoning directly in the embedding space, enabling a controllable trade-off between efficiency and accuracy. Experiments on large-scale retrieval benchmarks BRIGHT show that AdaQR reduces reasoning cost by 28% while preserving-or even improving-retrieval performance by 7%.
39. 【2510.21726】From Authors to Reviewers: Leveraging Rankings to Improve Peer Review
链接:https://arxiv.org/abs/2510.21726
作者:Weichen Wang,Chengchun Shi
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:JASA discussion paper, JASA discussion, ranking information, JASA, information
备注:
点击查看摘要
Abstract:This paper is a discussion of the 2025 JASA discussion paper by Su et al. (2025). We would like to congratulate the authors on conducting a comprehensive and insightful empirical investigation of the 2023 ICML ranking data. The review quality of machine learning (ML) conferences has become a big concern in recent years, due to the rapidly growing number of submitted manuscripts. In this discussion, we propose an approach alternative to Su et al. (2025) that leverages ranking information from reviewers rather than authors. We simulate review data that closely mimics the 2023 ICML conference submissions. Our results show that (i) incorporating ranking information from reviewers can significantly improve the evaluation of each paper's quality, often outperforming the use of ranking information from authors alone; and (ii) combining ranking information from both reviewers and authors yields the most accurate evaluation of submitted papers in most scenarios.
40. 【2510.21724】Words to Waves: Emotion-Adaptive Music Recommendation System
链接:https://arxiv.org/abs/2510.21724
作者:Apoorva Chavali,Reeve Menezes
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Current recommendation systems, static mood tags, historical listening patterns, Current recommendation, overlook emotional context
备注:
点击查看摘要
Abstract:Current recommendation systems often tend to overlook emotional context and rely on historical listening patterns or static mood tags. This paper introduces a novel music recommendation framework employing a variant of Wide and Deep Learning architecture that takes in real-time emotional states inferred directly from natural language as inputs and recommends songs that closely portray the mood. The system captures emotional contexts from user-provided textual descriptions by using transformer-based embeddings, which were finetuned to predict the emotional dimensions of valence-arousal. The deep component of the architecture utilizes these embeddings to generalize unseen emotional patterns, while the wide component effectively memorizes user-emotion and emotion-genre associations through cross-product features. Experimental results show that personalized music selections positively influence the user's emotions and lead to a significant improvement in emotional relevance.
41. 【2510.21714】Practice on Long Behavior Sequence Modeling in Tencent Advertising
链接:https://arxiv.org/abs/2510.21714
作者:Xian Hu,Ming Yue,Zhixiang Feng,Junwei Pan,Junjie Zhai,Ximei Wang,Xinrui Miao,Qian Li,Xun Liu,Shangyu Zhang,Letian Wang,Hua Lu,Zijian Zeng,Chen Cai,Wei Wang,Fei Xiong,Pengfei Xiong,Jintao Zhang,Zhiyuan Wu,Chunhui Zhang,Anan Liu,Jiulong You,Chao Deng,Yuekui Yang,Shudong Huang,Dapeng Liu,Haijie Gu
类目:Information Retrieval (cs.IR)
关键词:users' long-term preferences, capturing users' long-term, long-term preferences, indispensable frontier, frontier in recommendation
备注:
点击查看摘要
Abstract:Long-sequence modeling has become an indispensable frontier in recommendation systems for capturing users' long-term preferences. However, user behaviors within advertising domains are inherently sparse, posing a significant barrier to constructing long behavioral sequences using data from a single advertising domain alone. This motivates us to collect users' behaviors not only across diverse advertising scenarios, but also beyond the boundaries of the advertising domain into content domains-thereby constructing unified commercial behavior trajectories. This cross-domain or cross-scenario integration gives rise to the following challenges: (1) feature taxonomy gaps between distinct scenarios and domains, (2) inter-field interference arising from irrelevant feature field pairs, and (3) target-wise interference in temporal and semantic patterns when optimizing for different advertising targets. To address these challenges, we propose several practical approaches within the two-stage framework for long-sequence modeling. In the first (search) stage, we design a hierarchical hard search method for handling complex feature taxonomy hierarchies, alongside a decoupled embedding-based soft search to alleviate conflicts between attention mechanisms and feature representation. In the second (sequence modeling) stage, we introduce: (a) Decoupled Side Information Temporal Interest Networks (TIN) to mitigate inter-field conflicts; (b) Target-Decoupled Positional Encoding and Target-Decoupled SASRec to address target-wise interference; and (c) Stacked TIN to model high-order behavioral correlations. Deployed in production on Tencent's large-scale advertising platforms, our innovations delivered significant performance gains: an overall 4.22% GMV lift in WeChat Channels and an overall 1.96% GMV increase in WeChat Moments.
42. 【2510.21713】asLLR: LLM based Leads Ranking in Auto Sales
链接:https://arxiv.org/abs/2510.21713
作者:Yin Sun,Yiwen Liu,Junjie Song,Chenyu Zhang,Xinyuan Zhang,Lingjie Liu,Siqi Chen,Yuji Cao
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:score sequencing determines, high-quality lead score, lead score sequencing, Customer Relationship Management, auto sales system
备注:
点击查看摘要
Abstract:In the area of commercial auto sales system, high-quality lead score sequencing determines the priority of a sale's work and is essential for optimizing the efficiency of the sales system. Since CRM (Customer Relationship Management) system contains plenty of textual interaction features between sales and customers, traditional techniques such as Click Through Rate (CTR) prediction struggle with processing the complex information inherent in natural language features, which limits their effectiveness in sales lead ranking. Bridging this gap is critical for enhancing business intelligence and decision-making. Recently, the emergence of large language models (LLMs) has opened new avenues for improving recommendation systems, this study introduces asLLR (LLM-based Leads Ranking in Auto Sales), which integrates CTR loss and Question Answering (QA) loss within a decoder-only large language model architecture. This integration enables the simultaneous modeling of both tabular and natural language features. To verify the efficacy of asLLR, we constructed an innovative dataset derived from the customer lead pool of a prominent new energy vehicle brand, with 300,000 training samples and 40,000 testing samples. Our experimental results demonstrate that asLLR effectively models intricate patterns in commercial datasets, achieving the AUC of 0.8127, surpassing traditional CTR estimation methods by 0.0231. Moreover, asLLR enhances CTR models when used for extracting text features by 0.0058. In real-world sales scenarios, after rigorous online A/B testing, asLLR increased the sales volume by about 9.5% compared to the traditional method, providing a valuable tool for business intelligence and operational decision-making.
43. 【2510.21712】DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling
链接:https://arxiv.org/abs/2510.21712
作者:Hao Sun,Zile Qiao,Bo Wang,Guoxin Chen,Yingyan Hou,Yong Jiang,Pengjun Xie,Fei Huang,Yan Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:enhancing Large Language, Retrieval-Augmented Generation, Large Language Models, Agentic RAG, Agentic RAG introduces
备注: EMNLP 2025 Main Conference
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes, demonstrate the effectiveness of our method.
44. 【2510.21711】Improving E-commerce Search with Category-Aligned Retrieval
链接:https://arxiv.org/abs/2510.21711
作者:Rauf Aliev
类目:Information Retrieval (cs.IR)
关键词:Traditional e-commerce search, Traditional e-commerce, e-commerce search systems, Trainable Category Prototypes, semantic gap
备注:
点击查看摘要
Abstract:Traditional e-commerce search systems often struggle with the semantic gap between user queries and product catalogs. In this paper, we propose a Category-Aligned Retrieval System (CARS) that improves search relevance by first predicting the product category from a user's query and then boosting products within that category. We introduce a novel method for creating "Trainable Category Prototypes" from query embeddings. We evaluate this method with two models: a lightweight all-MiniLM-L6-v2 and OpenAI's text-embedding-ada-002. Our offline evaluation shows this method is highly effective, with the OpenAI model increasing Top-3 category prediction accuracy from a zero-shot baseline of 43.8% to 83.2% after training. The end-to-end simulation, however, highlights the limitations of blindly applying category boosts in a complex retrieval pipeline: while accuracy is high, naive integration can negatively affect search relevance metrics such as nDCG@10. We argue that this is partly due to dataset-specific ambiguities (e.g., polysemous queries in the Amazon ESCI corpus) and partly due to the sensitivity of retrieval systems to over-constraining filters. Crucially, these results do not diminish the value of the approach; rather, they emphasize the need for confidence-aware and adaptive integration strategies.
计算机视觉
1. 【2510.23607】Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
链接:https://arxiv.org/abs/2510.23607
作者:Yujia Zhang,Xiaoyang Wu,Yixing Lao,Chengyao Wang,Zhuotao Tian,Naiyan Wang,Hengshuang Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans learn abstract, learn abstract concepts, multisensory synergy, single modality, human concept learning
备注: NeurIPS 2025, produced by Pointcept, project page: [this https URL](https://pointcept.github.io/Concerto)
点击查看摘要
Abstract:Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
2. 【2510.23605】rack, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling
链接:https://arxiv.org/abs/2510.23605
作者:Shuhong Zheng,Ashkan Mirzaei,Igor Gilitschenski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:optimized for photorealism, Current, generation, Abstract, efficiency
备注: NeurIPS 2025, 38 pages, 22 figures
点击查看摘要
Abstract:Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at this https URL.
3. 【2510.23603】PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
链接:https://arxiv.org/abs/2510.23603
作者:Yuqian Yuan,Wenqiao Zhang,Xin Li,Shihao Wang,Kehan Li,Wentong Li,Jun Xiao,Lei Zhang,Beng Chin Ooi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, large language models, demonstrated strong general-purpose, strong general-purpose capabilities, Multimodal large
备注: 22 pages, 13 figures
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.
4. 【2510.23594】PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
链接:https://arxiv.org/abs/2510.23594
作者:Yusu Qian,Cheng Wan,Chao Jia,Yinfei Yang,Qingyu Zhao,Zhe Gan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, Multimodal large language, large language models, reasoning processes remain, remain sometimes unreliable
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
5. 【2510.23589】InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras
链接:https://arxiv.org/abs/2510.23589
作者:Erich Liang,Roma Bhattacharjee,Sreemanti Dey,Rafael Moschopoulos,Caitlin Wang,Michel Liao,Grace Tan,Andrew Wang,Karhan Kayan,Stamatis Alexandropoulos,Jia Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurately tracking camera, Accurately tracking, camera intrinsics, dynamic camera intrinsics, tracking camera intrinsics
备注: Accepted at NeurIPS 2025 DB Track, Camera Ready Version. Supplementary material included
点击查看摘要
Abstract:Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit this https URL.
6. 【2510.23588】FARMER: Flow AutoRegressive Transformer over Pixels
链接:https://arxiv.org/abs/2510.23588
作者:Guangting Zheng,Qinyu Zhao,Tao Yang,Fei Xiao,Zhijie Lin,Jie Wu,Jiajun Deng,Yanyong Zhang,Rui Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Large Language, machine learning area, successes in Large, Language Models
备注: Bytedance Seed Technical Report
点击查看摘要
Abstract:Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.
7. 【2510.23581】Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
链接:https://arxiv.org/abs/2510.23581
作者:Junyoung Seo,Rodrigo Mira,Alexandros Haliassos,Stella Bounareli,Honglie Chen,Linh Tran,Seungryong Kim,Zoe Landgraf,Jie Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:characters gradually lose, Audio-driven human animation, Audio-driven human, characters gradually, gradually lose
备注: Project page: [this https URL](https://lookahead-anchoring.github.io)
点击查看摘要
Abstract:Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: this https URL.
8. 【2510.23576】UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
链接:https://arxiv.org/abs/2510.23576
作者:Anqi Li,Zhiyong Wang,Jiazhao Zhang,Minghan Li,Yunpeng Qi,Zhibo Chen,Zhizheng Zhang,He Wang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:long-horizon route instructions, Urban micromobility applications, micromobility applications, Urban micromobility, demand reliable navigation
备注:
点击查看摘要
Abstract:Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as route-visual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA to master both levels of navigation, we employ a two-stage training pipeline. The process begins with Supervised Fine-Tuning (SFT) using simulated environments and trajectories parsed from web videos. This is followed by Reinforcement Fine-Tuning (RFT) on a mixture of simulation and real-world data, which enhances the model's safety and adaptability in real-world settings. Experiments demonstrate that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task on MetaUrban. Furthermore, UrbanVLA achieves reliable real-world navigation, showcasing both scalability to large-scale urban environments and robustness against real-world uncertainties.
9. 【2510.23574】More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2510.23574
作者:Hongkai Lin,Dingkang Liang,Mingyang Du,Xin Zhou,Xiang Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative depth estimation, demonstrating astonishing zero-shot, rich visual priors, visual priors stored, Generative depth
备注: Accepted by NeurIPS 2025. The code will be made available at [this https URL](https://github.com/H-EmbodVis/MERGE)
点击查看摘要
Abstract:Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degra- dation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play- and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and im- prove the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of- the-art performance across multiple depth estimation benchmarks. The code will be made available at this https URL
10. 【2510.23571】RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
链接:https://arxiv.org/abs/2510.23571
作者:Yash Jangir,Yidi Zhang,Kashu Yamazaki,Chenyu Zhang,Kuan-Hsun Tu,Tsung-Wei Ke,Lei Ke,Yonatan Bisk,Katerina Fragkiadaki
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:performing diverse tasks, instructable agents capable, performing diverse, diverse tasks, instructable agents
备注: Website: [this https URL](https://robotarenainf.github.io)
点击查看摘要
Abstract:The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
11. 【2510.23569】EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
链接:https://arxiv.org/abs/2510.23569
作者:Baoqi Pei,Yifei Huang,Jilan Xu,Yuping He,Guo Chen,Fei Wu,Yu Qiao,Jiangmiao Pang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shapes the environment, requiring inference, unobservable agent, camera who dynamically, dynamically shapes
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at this https URL.
12. 【2510.23554】A U-Net and Transformer Pipeline for Multilingual Image Translation
链接:https://arxiv.org/abs/2510.23554
作者:Siddharth Sahay,Radhika Agarwal
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Machine Translation, Neural Machine, Machine Translation, Transformer for Neural, Tesseract engine
备注: 6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE double-column format
点击查看摘要
Abstract:This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.
13. 【2510.23538】JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
链接:https://arxiv.org/abs/2510.23538
作者:Qiushi Sun,Jingyang Gong,Yang Liu,Qiaosheng Chen,Lei Li,Kai Chen,Qipeng Guo,Ben Kao,Fei Yuan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
关键词:rich visual outputs, neural code intelligence, text-based source code, programs generate, scope of neural
备注: Work in progress
点击查看摘要
Abstract:The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at this https URL.
14. 【2510.23525】DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation
链接:https://arxiv.org/abs/2510.23525
作者:Wanmeng Li,Simone Mosco,Daniel Fusaro,Alberto Pretto
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Annotating real-world LiDAR, intelligent autonomous systems, Unsupervised Domain Adaptation, Annotating real-world, systems is costly
备注: This paper has been accepted for publication at the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
点击查看摘要
Abstract:Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.
15. 【2510.23515】FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time
链接:https://arxiv.org/abs/2510.23515
作者:Yaoli Liu,Yao-Xiang Ding,Kun Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper proposes FreeFuse, multiple subject LoRAs, paper proposes, training-free approach, automatic fusion
备注:
点击查看摘要
Abstract:This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at this https URL
16. 【2510.23512】Localising under the drape: proprioception in the era of distributed surgical robotic system
链接:https://arxiv.org/abs/2510.23512
作者:Martin Huber,Nicola A. Cavalcanti,Ayoob Davoodi,Ruixuan Li,Christopher E. Mower,Fabio Carrillo,Christoph J. Laux,Francois Teyssere,Thibault Chandanson,Antoine Harlé,Elie Saghbiny,Mazda Farshad,Guillaume Morel,Emmanuel Vander Poorten,Philipp Fürnstahl,Sébastien Ourselin,Christos Bergeles,Tom Vercauteren
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:robots remain blind, surgical robots remain, mechanical sophistication, remain blind, surgical robots
备注:
点击查看摘要
Abstract:Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, issues that will intensify with the introduction of distributed robots with independent interacting arms. Existing tracking systems rely on bulky infrared cameras and reflective markers, providing only limited views of the surgical scene and adding hardware burden in crowded operating rooms. We present a marker-free proprioception method that enables precise localisation of surgical robots under their sterile draping despite associated obstruction of visual cues. Our method solely relies on lightweight stereo-RGB cameras and novel transformer-based deep learning models. It builds on the largest multi-centre spatial robotic surgery dataset to date (1.4M self-annotated images from human cadaveric and preclinical in vivo studies). By tracking the entire robot and surgical scene, rather than individual markers, our approach provides a holistic view robust to occlusions, supporting surgical scene understanding and context-aware control. We demonstrate an example of potential clinical benefits during in vivo breathing compensation with access to tissue dynamics, unobservable under state of the art tracking, and accurately locate in multi-robot systems for future intelligent interaction. In addition, and compared with existing systems, our method eliminates markers and improves tracking visibility by 25%. To our knowledge, this is the first demonstration of marker-free proprioception for fully draped surgical robots, reducing setup complexity, enhancing safety, and paving the way toward modular and autonomous robotic surgery.
17. 【2510.23504】Pac: Incorporating Intra-image Patch Context into Graph Neural Networks for Medical Image Classification
链接:https://arxiv.org/abs/2510.23504
作者:Usama Zidan,Mohamed Gaber,Mohammed M. Abdelsamea
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:graph neural network, neural network image, network image classification, image classification, structure and relationships
备注: Accepted for publication in the proceedings of ICONIP 2025
点击查看摘要
Abstract:Graph neural networks have emerged as a promising paradigm for image processing, yet their performance in image classification tasks is hindered by a limited consideration of the underlying structure and relationships among visual entities. This work presents iPac, a novel approach to introduce a new graph representation of images to enhance graph neural network image classification by recognizing the importance of underlying structure and relationships in medical image classification. iPac integrates various stages, including patch partitioning, feature extraction, clustering, graph construction, and graph-based learning, into a unified network to advance graph neural network image classification. By capturing relevant features and organising them into clusters, we construct a meaningful graph representation that effectively encapsulates the semantics of the image. Experimental evaluation on diverse medical image datasets demonstrates the efficacy of iPac, exhibiting an average accuracy improvement of up to 5% over baseline methods. Our approach offers a versatile and generic solution for image classification, particularly in the realm of medical images, by leveraging the graph representation and accounting for the inherent structure and relationships among visual entities.
18. 【2510.23497】VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
链接:https://arxiv.org/abs/2510.23497
作者:Walid Bousselham,Hilde Kuehne,Cordelia Schmid
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image-text reasoning data, complex reasoning remains, high-quality image-text reasoning, challenging task, remains a challenging
备注: [this http URL](http://www.walidbousselham.com/VOLD/)
点击查看摘要
Abstract:Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
19. 【2510.23494】Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap
链接:https://arxiv.org/abs/2510.23494
作者:Elisabeth Jüttner,Leona Krath,Stefan Korfhage,Hannah Dröge,Matthias B. Hullin,Markus Plack
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:bringing captured performances, current approaches struggle, virtual worlds, essential for bringing, bringing captured
备注:
点击查看摘要
Abstract:Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.
20. 【2510.23484】-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning
链接:https://arxiv.org/abs/2510.23484
作者:Julie Mordacq,David Loiseaux,Vicky Kalogeiton,Steve Oudot
类目:Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Self-supervised learning, Minimum Spanning Tree, rotations or blurring, powerful paradigm, enforcing invariance
备注: NeurIPS 2025
点击查看摘要
Abstract:Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations or blurring. Recent studies have highlighted two pivotal properties for effective representations: (i) avoiding dimensional collapse-where the learned features occupy only a low-dimensional subspace, and (ii) enhancing uniformity of the induced distribution. In this work, we introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation. We provide theoretical analysis demonstrating that T-REGS simultaneously mitigates dimensional collapse and promotes distribution uniformity on arbitrary compact Riemannian manifolds. Several experiments on synthetic data and on classical SSL benchmarks validate the effectiveness of our approach at enhancing representation quality.
21. 【2510.23482】On the Faithfulness of Visual Thinking: Measurement and Enhancement
链接:https://arxiv.org/abs/2510.23482
作者:Zujing Liu,Junwen Pan,Qi She,Yuan Gao,Guisong Xia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent large vision-language, Recent large, large vision-language models, visual information, visual
备注:
点击查看摘要
Abstract:Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at this https URL.
22. 【2510.23479】MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
链接:https://arxiv.org/abs/2510.23479
作者:Xin Jin,Siyuan Li,Siyong Jian,Kai Yu,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multi-modal large language, Vision-language alignment, language models, typically relies
备注: Code Link: [this https URL](https://github.com/JinXins/MergeMix)
点击查看摘要
Abstract:Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.
23. 【2510.23478】UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception
链接:https://arxiv.org/abs/2510.23478
作者:Karthikeyan Chandra Sekaran,Markus Geisler,Dominik Rößle,Adithya Mohan,Daniel Cremers,Wolfgang Utschick,Michael Botsch,Werner Huber,Torsten Schön
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advancing smart mobility, smart mobility applications, enabling information exchange, Recent cooperative perception, Recent cooperative
备注: Accepted to NeurIPS 2025. Including supplemental material. For code and dataset, see [this https URL](https://github.com/thi-ad/UrbanIng-V2X)
点击查看摘要
Abstract:Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.
24. 【2510.23473】Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning
链接:https://arxiv.org/abs/2510.23473
作者:Shijian Wang,Jiarui Jin,Xingjian Wang,Linxin Song,Runhao Fu,Hecheng Wang,Zongyuan Ge,Yuan Lu,Xuelian Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.
25. 【2510.23451】Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences
链接:https://arxiv.org/abs/2510.23451
作者:Zhuoran Jin,Hongbang Yuan,Kejian Zhu,Jiachun Li,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modality Imbalance, offering limited support, fixed binary preference, preference pairs fails, binary preference pairs
备注: 48 pages, 17 figures
点击查看摘要
Abstract:Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.
26. 【2510.23444】FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network
链接:https://arxiv.org/abs/2510.23444
作者:Fangtong Sun,Congyu Li,Ke Yang,Yuchen Pan,Hanwen Yu,Xichuan Zhang,Yiying Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:severe illumination degradation, computer vision due, Low-light vision remains, illumination degradation, textbf
备注:
点击查看摘要
Abstract:Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: this https URL.
27. 【2510.23442】CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification
链接:https://arxiv.org/abs/2510.23442
作者:Asmaa Abbas,Mohamed Gaber,Mohammed M. Abdelsamea
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:easily accessible annotated, Identifying high-quality, accessible annotated samples, annotated samples poses, high-quality and easily
备注: Accepted for publication in the proceedings of ICONIP 2025
点击查看摘要
Abstract:Identifying high-quality and easily accessible annotated samples poses a notable challenge in medical image analysis. Transfer learning techniques, leveraging pre-training data, offer a flexible solution to this issue. However, the impact of fine-tuning diminishes when the dataset exhibits an irregular distribution between classes. This paper introduces a novel deep convolutional neural network, named Curriculum Learning and Progressive Self-supervised Training (CURVETE). CURVETE addresses challenges related to limited samples, enhances model generalisability, and improves overall classification performance. It achieves this by employing a curriculum learning strategy based on the granularity of sample decomposition during the training of generic unlabelled samples. Moreover, CURVETE address the challenge of irregular class distribution by incorporating a class decomposition approach in the downstream task. The proposed method undergoes evaluation on three distinct medical image datasets: brain tumour, digital knee x-ray, and Mini-DDSM datasets. We investigate the classification performance using a generic self-supervised sample decomposition approach with and without the curriculum learning component in training the pretext task. Experimental results demonstrate that the CURVETE model achieves superior performance on test sets with an accuracy of 96.60% on the brain tumour dataset, 75.60% on the digital knee x-ray dataset, and 93.35% on the Mini-DDSM dataset using the baseline ResNet-50. Furthermore, with the baseline DenseNet-121, it achieved accuracies of 95.77%, 80.36%, and 93.22% on the brain tumour, digital knee x-ray, and Mini-DDSM datasets, respectively, outperforming other training strategies.
28. 【2510.23429】MiCADangelo: Fine-Grained Reconstruction of Constrained CAD Models from 3D Scans
链接:https://arxiv.org/abs/2510.23429
作者:Ahmet Serdar Karadeniz,Dimitrios Mallis,Danila Rukhovich,Kseniya Cherenkova,Anis Kacem,Djamila Aouada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Computer-Aided Design, CAD models, plays a foundational, product development, CAD reverse engineering
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Computer-Aided Design (CAD) plays a foundational role in modern manufacturing and product development, often requiring designers to modify or build upon existing models. Converting 3D scans into parametric CAD representations--a process known as CAD reverse engineering--remains a significant challenge due to the high precision and structural complexity of CAD models. Existing deep learning-based approaches typically fall into two categories: bottom-up, geometry-driven methods, which often fail to produce fully parametric outputs, and top-down strategies, which tend to overlook fine-grained geometric details. Moreover, current methods neglect an essential aspect of CAD modeling: sketch-level constraints. In this work, we introduce a novel approach to CAD reverse engineering inspired by how human designers manually perform the task. Our method leverages multi-plane cross-sections to extract 2D patterns and capture fine parametric details more effectively. It enables the reconstruction of detailed and editable CAD models, outperforming state-of-the-art methods and, for the first time, incorporating sketch constraints directly into the reconstruction process.
29. 【2510.23416】Quality-controlled registration of urban MLS point clouds reducing drift effects by adaptive fragmentation
链接:https://arxiv.org/abs/2510.23416
作者:Marco Antonio Ortiz Rincon,Yihui Yang,Christoph Holst
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:mobile laser scanning, accurately register large-scale, register large-scale mobile, large-scale mobile laser, urban street scenarios
备注: 10 pages, 7 figures. This manuscript is currently under review at the International Journal of Applied Earth Observation and Geoinformation (Elsevier). A preprint version will also be available on SSRN (Elsevier Preprints) with a DOI once processed. This is the original preprint version submitted for peer review
点击查看摘要
Abstract:This study presents a novel workflow designed to efficiently and accurately register large-scale mobile laser scanning (MLS) point clouds to a target model point cloud in urban street scenarios. This workflow specifically targets the complexities inherent in urban environments and adeptly addresses the challenges of integrating point clouds that vary in density, noise characteristics, and occlusion scenarios, which are common in bustling city centers. Two methodological advancements are introduced. First, the proposed Semi-sphere Check (SSC) preprocessing technique optimally fragments MLS trajectory data by identifying mutually orthogonal planar surfaces. This step reduces the impact of MLS drift on the accuracy of the entire point cloud registration, while ensuring sufficient geometric features within each fragment to avoid local minima. Second, we propose Planar Voxel-based Generalized Iterative Closest Point (PV-GICP), a fine registration method that selectively utilizes planar surfaces within voxel partitions. This pre-process strategy not only improves registration accuracy but also reduces computation time by more than 50% compared to conventional point-to-plane ICP methods. Experiments on real-world datasets from Munich's inner city demonstrate that our workflow achieves sub-0.01 m average registration accuracy while significantly shortening processing times. The results underscore the potential of the proposed methods to advance automated 3D urban modeling and updating, with direct applications in urban planning, infrastructure management, and dynamic city monitoring.
30. 【2510.23415】owards Generalisable Foundation Models for 3D Brain MRI
链接:https://arxiv.org/abs/2510.23415
作者:Moona Mazher,Geoff J. M. Parker,Daniel C. Alexander
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:general-purpose feature learning, enabling general-purpose feature, transforming medical imaging, unlabeled datasets, artificial intelligence
备注:
点击查看摘要
Abstract:Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.
31. 【2510.23414】Symmetria: A Synthetic Dataset for Learning in Point Clouds
链接:https://arxiv.org/abs/2510.23414
作者:Ivan Sipiran,Gustavo Santelices,Lucas Oyarzún,Andrea Ranieri,Chiara Romanengo,Silvia Biasotti,Bianca Falcidieno
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:techniques frequently encounter, frequently encounter limitations, encounter limitations due, Unlike image, learning techniques frequently
备注: 40 pages
点击查看摘要
Abstract:Unlike image or text domains that benefit from an abundance of large-scale datasets, point cloud learning techniques frequently encounter limitations due to the scarcity of extensive datasets. To overcome this limitation, we present Symmetria, a formula-driven dataset that can be generated at any arbitrary scale. By construction, it ensures the absolute availability of precise ground truth, promotes data-efficient experimentation by requiring fewer samples, enables broad generalization across diverse geometric settings, and offers easy extensibility to new tasks and modalities. Using the concept of symmetry, we create shapes with known structure and high variability, enabling neural networks to learn point cloud features effectively. Our results demonstrate that this dataset is highly effective for point cloud self-supervised pre-training, yielding models with strong performance in downstream tasks such as classification and segmentation, which also show good few-shot learning capabilities. Additionally, our dataset can support fine-tuning models to classify real-world objects, highlighting our approach's practical utility and application. We also introduce a challenging task for symmetry detection and provide a benchmark for baseline comparisons. A significant advantage of our approach is the public availability of the dataset, the accompanying code, and the ability to generate very large collections, promoting further research and innovation in point cloud learning.
32. 【2510.23399】Color and Frequency Correction for Image Colorization
链接:https://arxiv.org/abs/2510.23399
作者:Yun Kai Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autocolorization direction model, existing Autocolorization direction, direction model DDColor, Autocolorization direction, existing Autocolorization
备注: 7 pages, 5 tables
点击查看摘要
Abstract:The project has carried out the re-optimization of image coloring in accordance with the existing Autocolorization direction model DDColor. For the experiments on the existing weights of DDColor, we found that it has limitations in some frequency bands and the color cast problem caused by insufficient input dimension. We construct two optimization schemes and combine them, which achieves the performance improvement of indicators such as PSNR and SSIM of the images after DDColor.
33. 【2510.23397】VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations
链接:https://arxiv.org/abs/2510.23397
作者:Lu Dong,Haiyu Zhang,Han Lin,Ziang Yan,Xiangyu Zeng,Hongjie Zhang,Yifei Huang,Yi Wang,Zhen-Hua Ling,Limin Wang,Yali Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Video temporal grounding, locate precise segments, recent Multimodal Large
备注:
点击查看摘要
Abstract:Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at this https URL.
34. 【2510.23382】An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping
链接:https://arxiv.org/abs/2510.23382
作者:Songxi Yang,Tang Sui,Qunying Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:historically valuable remote, Super resolution offers, valuable remote sensing, remote sensing super, sensing super resolution
备注: 41 pages
点击查看摘要
Abstract:Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been applied to remote sensing super resolution (RSSR), yet several challenges exist. First, diffusion models are effective but require expensive training from scratch resources and have slow inference speeds. Second, current methods have limited utilization of auxiliary information as real-world constraints to reconstruct scientifically realistic images. Finally, most current methods lack evaluation on downstream tasks. In this study, we present a efficient LSSR framework for RSSR, supported by a new multimodal dataset of paired 30 m Landsat 8 and 10 m Sentinel 2 imagery. Built on frozen pretrained Stable Diffusion, LSSR integrates crossmodal attention with auxiliary knowledge (Digital Elevation Model, land cover, month) and Synthetic Aperture Radar guidance, enhanced by adapters and a tailored Fourier NDVI loss to balance spatial details and spectral fidelity. Extensive experiments demonstrate that LSSR significantly improves crop boundary delineation and recovery, achieving state-of-the-art performance with Peak Signal-to-Noise Ratio/Structural Similarity Index Measure of 32.63/0.84 (RGB) and 23.99/0.78 (IR), and the lowest NDVI Mean Squared Error (0.042), while maintaining efficient inference (0.39 sec/image). Moreover, LSSR transfers effectively to NASA Harmonized Landsat and Sentinel (HLS) super resolution, yielding more reliable crop classification (F1: 0.86) than Sentinel-2 (F1: 0.85). These results highlight the potential of RSSR to advance precision agriculture.
35. 【2510.23368】PlanarTrack: A high-quality and challenging benchmark for large-scale planar object tracking
链接:https://arxiv.org/abs/2510.23368
作者:Yifan Jiao,Xinran Liu,Xiaoqiong Liu,Xiaohui Yuan,Heng Fan,Libo Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:drawn increasing interest, increasing interest owing, Planar tracking, augmented reality, drawn increasing
备注:
点击查看摘要
Abstract:Planar tracking has drawn increasing interest owing to its key roles in robotics and augmented reality. Despite recent great advancement, further development of planar tracking, particularly in the deep learning era, is largely limited compared to generic tracking due to the lack of large-scale platforms. To mitigate this, we propose PlanarTrack, a large-scale high-quality and challenging benchmark for planar tracking. Specifically, PlanarTrack consists of 1,150 sequences with over 733K frames, including 1,000 short-term and 150 new long-term videos, which enables comprehensive evaluation of short- and long-term tracking performance. All videos in PlanarTrack are recorded in unconstrained conditions from the wild, which makes PlanarTrack challenging but more realistic for real-world applications. To ensure high-quality annotations, each video frame is manually annotated by four corner points with multi-round meticulous inspection and refinement. To enhance target diversity of PlanarTrack, we only capture a unique target in one sequence, which is different from existing benchmarks. To our best knowledge, PlanarTrack is by far the largest and most diverse and challenging dataset dedicated to planar tracking. To understand performance of existing methods on PlanarTrack and to provide a comparison for future research, we evaluate 10 representative planar trackers with extensive comparison and in-depth analysis. Our evaluation reveals that, unsurprisingly, the top planar trackers heavily degrade on the challenging PlanarTrack, which indicates more efforts are required for improving planar tracking. Our data and results will be released at this https URL
36. 【2510.23363】Interpretable Tile-Based Classification of Paclitaxel Exposure
链接:https://arxiv.org/abs/2510.23363
作者:Sean Fletcher,Gabby Scott,Douglas Currie,Xin Zhang,Yuqi Song,Bruce MacLeod
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical image analysis, preclinical evaluation, objective readouts, accelerate decision-making, analysis is central
备注:
点击查看摘要
Abstract:Medical image analysis is central to drug discovery and preclinical evaluation, where scalable, objective readouts can accelerate decision-making. We address classification of paclitaxel (Taxol) exposure from phase-contrast microscopy of C6 glioma cells -- a task with subtle dose differences that challenges full-image models. We propose a simple tiling-and-aggregation pipeline that operates on local patches and combines tile outputs into an image label, achieving state-of-the-art accuracy on the benchmark dataset and improving over the published baseline by around 20 percentage points, with trends confirmed by cross-validation. To understand why tiling is effective, we further apply Grad-CAM and Score-CAM and attention analyses, which enhance model interpretability and point toward robustness-oriented directions for future medical image research. Code is released to facilitate reproduction and extension.
37. 【2510.23325】Multitask Multimodal Self-Supervised Learning for Medical Images
链接:https://arxiv.org/abs/2510.23325
作者:Cristian Simionescu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:medical image, medical image analysis, medical image datasets, extensive labeled datasets, medical imaging
备注:
点击查看摘要
Abstract:This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and constrained by privacy and legal issues. By focusing on the development of self-supervised learning techniques and domain adaptation methods, this research aims to circumvent these limitations, presenting a novel approach to enhance the utility and efficacy of deep learning in medical imaging. Central to this thesis is the development of the Medformer, an innovative neural network architecture designed for multitask learning and deep domain adaptation. This model is adept at pre-training on diverse medical image datasets, handling varying sizes and modalities, and is equipped with a dynamic input-output adaptation mechanism. This enables efficient processing and integration of a wide range of medical image types, from 2D X-rays to complex 3D MRIs, thus mitigating the dependency on large labeled datasets. Further, the thesis explores the current state of self-supervised learning in medical imaging. It introduces novel pretext tasks that are capable of extracting meaningful information from unlabeled data, significantly advancing the model's interpretative abilities. This approach is validated through rigorous experimentation, including the use of the MedMNIST dataset, demonstrating the model's proficiency in learning generalized features applicable to various downstream tasks. In summary, this thesis contributes to the advancement of medical image analysis by offering a scalable, adaptable framework that reduces reliance on labeled data. It paves the way for more accurate, efficient diagnostic tools in healthcare, signifying a major step forward in the application of deep learning in medical imaging.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2510.23325 [cs.CV]
(or
arXiv:2510.23325v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.23325
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Cristian Simionescu Drd [view email] [v1]
Mon, 27 Oct 2025 13:42:16 UTC (11,464 KB)
38. 【2510.23306】ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation
链接:https://arxiv.org/abs/2510.23306
作者:Jiahao Chang,Chongjie Ye,Yushuang Wu,Yuantao Chen,Yidan Zhang,Zhongjin Luo,Chenghong Li,Yihao Zhi,Xiaoguang Han
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:practice frequently yield, frequently yield severe, severe reconstruction incompleteness, methods heavily rely, yield severe reconstruction
备注: 18 pages, 7 figures
点击查看摘要
Abstract:Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions, and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local this http URL page: this https URL.
39. 【2510.23301】MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification
链接:https://arxiv.org/abs/2510.23301
作者:Yingying Feng,Jie Li,Jie Hu,Yukang Zhang,Lei Tan,Jiayi Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-world object re-identification, Real-world object, face modality inconsistencies, object re-identification, systems often face
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in general modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{this https URL}.
40. 【2510.23299】MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection
链接:https://arxiv.org/abs/2510.23299
作者:Haochen Zhao,Yuyao Kong,Yongxiu Xu,Gaopeng Gou,Hongbo Xu,Yubin Wang,Haoliang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:overlooking potential semantic, methods predominantly focus, multimodal sarcasm detection, existing datasets, overlooking potential
备注:
点击查看摘要
Abstract:Despite progress in multimodal sarcasm detection, existing datasets and methods predominantly focus on single-image scenarios, overlooking potential semantic and affective relations across multiple images. This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. We further propose the Cross-Image Reasoning Model (CIRM), which performs targeted cross-image sequence modeling to capture latent inter-image connections. In addition, we introduce a relevance-guided, fine-grained cross-modal fusion mechanism based on text-image correspondence to reduce information loss during integration. We establish a comprehensive suite of strong and representative baselines and conduct extensive experiments, showing that MMSD3.0 is an effective and reliable benchmark that better reflects real-world conditions. Moreover, CIRM demonstrates state-of-the-art performance across MMSD, MMSD2.0 and MMSD3.0, validating its effectiveness in both single-image and multi-image scenarios.
41. 【2510.23285】Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling
链接:https://arxiv.org/abs/2510.23285
作者:Ruoyu Wang,Beier Zhu,Junzhi Li,Liangyu Yuan,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion-based generative processes, differential equation solving, frequently balance computational, balance computational speed, Diffusion-based generative
备注: To appear in NeurIPS 2025
点击查看摘要
Abstract:Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in this https URL.
42. 【2510.23278】hYOLO Model: Enhancing Object Classification with Hierarchical Context in YOLOv8
链接:https://arxiv.org/abs/2510.23278
作者:Veska Tsenkova,Peter Stanchev,Daniel Petrov,Deyan Lazarov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current convolution neural, convolution neural network, Current convolution, neural network, convolution neural
备注: 39 pages, 12 figures, 4 tables, code available at [this https URL](https://github.com/ds2run/hyolo)
点击查看摘要
Abstract:Current convolution neural network (CNN) classification methods are predominantly focused on flat classification which aims solely to identify a specified object within an image. However, real-world objects often possess a natural hierarchical organization that can significantly help classification tasks. Capturing the presence of relations between objects enables better contextual understanding as well as control over the severity of mistakes. Considering these aspects, this paper proposes an end-to-end hierarchical model for image detection and classification built upon the YOLO model family. A novel hierarchical architecture, a modified loss function, and a performance metric tailored to the hierarchical nature of the model are introduced. The proposed model is trained and evaluated on two different hierarchical categorizations of the same dataset: a systematic categorization that disregards visual similarities between objects and a categorization accounting for common visual characteristics across classes. The results illustrate how the suggested methodology addresses the inherent hierarchical structure present in real-world objects, which conventional flat classification algorithms often overlook.
43. 【2510.23253】A Video Is Not Worth a Thousand Words
链接:https://arxiv.org/abs/2510.23253
作者:Sam Pollard,Michael Wray
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision language models, multiple-choice VQA task, multiple-choice VQA, increasingly dependent, dependent on vision
备注:
点击查看摘要
Abstract:As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model's ability to ignore distractors. Code available at this https URL.
44. 【2510.23241】Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation
链接:https://arxiv.org/abs/2510.23241
作者:Stefan M. Fischer,Johannes Kiechle,Laura Daza,Lina Felsner,Richard Osuala,Daniel M. Lang,Karim Lekadir,Jan C. Peeken,Julia A. Schnabel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:introduce Progressive Growing, Progressive Growing, Dice score performance, Dice score, introduce Progressive
备注: Journal Extension of "Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks" (MICCAI2024) submitted to MedIA
点击查看摘要
Abstract:In this work, we introduce Progressive Growing of Patch Size, an automatic curriculum learning approach for 3D medical image segmentation. Our approach progressively increases the patch size during model training, resulting in an improved class balance for smaller patch sizes and accelerated convergence of the training process. We evaluate our curriculum approach in two settings: a resource-efficient mode and a performance mode, both regarding Dice score performance and computational costs across 15 diverse and popular 3D medical image segmentation tasks. The resource-efficient mode matches the Dice score performance of the conventional constant patch size sampling baseline with a notable reduction in training time to only 44%. The performance mode improves upon constant patch size segmentation results, achieving a statistically significant relative mean performance gain of 1.28% in Dice Score. Remarkably, across all 15 tasks, our proposed performance mode manages to surpass the constant patch size baseline in Dice Score performance, while simultaneously reducing training time to only 89%. The benefits are particularly pronounced for highly imbalanced tasks such as lesion segmentation tasks. Rigorous experiments demonstrate that our performance mode not only improves mean segmentation performance but also reduces performance variance, yielding more trustworthy model comparison. Furthermore, our findings reveal that the proposed curriculum sampling is not tied to a specific architecture but represents a broadly applicable strategy that consistently boosts performance across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In summary, we show that this simple yet elegant transformation on input data substantially improves both Dice Score performance and training runtime, while being compatible across diverse segmentation backbones.
45. 【2510.23240】Autoregressive Styled Text Image Generation, but Make it Reliable
链接:https://arxiv.org/abs/2510.23240
作者:Carmine Zaccagnino,Fabio Quattrini,Vittorio Pippi,Silvia Cascianelli,Alessio Tonioni,Rita Cucchiara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Styled Handwritten Text, styled text images, readable styled text, Handwritten Text generation, Styled Handwritten
备注:
点击查看摘要
Abstract:Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.
46. 【2510.23225】hrough the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions
链接:https://arxiv.org/abs/2510.23225
作者:Razaib Tariq,Minji Heo,Simon S. Woo,Shahroz Tariq
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:distort detection outcomes, introduces Moiré artifacts, remains a pressing, settings where smartphone-captured, smartphone-captured media
备注: 48 Pages, 29 Figures, 15 Tables
点击查看摘要
Abstract:Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moiré artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moiré-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moiré patterns on deepfake detection, we conducted additional experiments using our DeepMoiréFake, referred to as (DMF) dataset and two synthetic Moiré generation techniques. Across 15 top-performing detectors, our results show that Moiré artifacts degrade performance by as much as 25.4%, while synthetically generated Moiré patterns lead to a 21.4% drop in accuracy. Surprisingly, demoiréing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moiré distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.
47. 【2510.23224】Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment
链接:https://arxiv.org/abs/2510.23224
作者:Hongyi Wang,Zhengjie Zhu,Jiabo Ma,Fang Wang,Yue Shi,Bo Luo,Jili Wang,Qiuyu Cai,Xiuming Zhang,Yen-Wei Chen,Lanfen Lin,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:rapid digitization, digitization of histopathology, possibilities for computational, computational tools, retrieval
备注:
点击查看摘要
Abstract:The rapid digitization of histopathology slides has opened up new possibilities for computational tools in clinical and research workflows. Among these, content-based slide retrieval stands out, enabling pathologists to identify morphologically and semantically similar cases, thereby supporting precise diagnoses, enhancing consistency across observers, and assisting example-based education. However, effective retrieval of whole slide images (WSIs) remains challenging due to their gigapixel scale and the difficulty of capturing subtle semantic differences amid abundant irrelevant content. To overcome these challenges, we present PathSearch, a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning. Trained on a corpus of 6,926 slide-report pairs, PathSearch captures both fine-grained morphological cues and high-level semantic patterns to enable accurate and flexible retrieval. The framework supports two key functionalities: (1) mosaic-based image-to-image retrieval, ensuring accurate and efficient slide research; and (2) multi-modal retrieval, where text queries can directly retrieve relevant slides. PathSearch was rigorously evaluated on four public pathology datasets and three in-house cohorts, covering tasks including anatomical site retrieval, tumor subtyping, tumor vs. non-tumor discrimination, and grading across diverse organs such as breast, lung, kidney, liver, and stomach. External results show that PathSearch outperforms traditional image-to-image retrieval frameworks. A multi-center reader study further demonstrates that PathSearch improves diagnostic accuracy, boosts confidence, and enhances inter-observer agreement among pathologists in real clinical scenarios. These results establish PathSearch as a scalable and generalizable retrieval solution for digital pathology.
48. 【2510.23205】VR-Drive: Viewpoint-Robust End-to-End Driving with Feed-Forward 3D Gaussian Splatting
链接:https://arxiv.org/abs/2510.23205
作者:Hoonhee Cho,Jae-Young Kang,Giwon Lee,Hyemin Yang,Heejun Park,Seokwoo Jung,Kuk-Jin Yoon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unifies perception, promising paradigm, paradigm that unifies, data-driven framework, autonomous driving
备注: Accepted by NeurIPS2025
点击查看摘要
Abstract:End-to-end autonomous driving (E2E-AD) has emerged as a promising paradigm that unifies perception, prediction, and planning into a holistic, data-driven framework. However, achieving robustness to varying camera viewpoints, a common real-world challenge due to diverse vehicle configurations, remains an open problem. In this work, we propose VR-Drive, a novel E2E-AD framework that addresses viewpoint generalization by jointly learning 3D scene reconstruction as an auxiliary task to enable planning-aware view synthesis. Unlike prior scene-specific synthesis approaches, VR-Drive adopts a feed-forward inference strategy that supports online training-time augmentation from sparse views without additional annotations. To further improve viewpoint consistency, we introduce a viewpoint-mixed memory bank that facilitates temporal interaction across multiple viewpoints and a viewpoint-consistent distillation strategy that transfers knowledge from original to synthesized views. Trained in a fully end-to-end manner, VR-Drive effectively mitigates synthesis-induced noise and improves planning under viewpoint shifts. In addition, we release a new benchmark dataset to evaluate E2E-AD performance under novel camera viewpoints, enabling comprehensive analysis. Our results demonstrate that VR-Drive is a scalable and robust solution for the real-world deployment of end-to-end autonomous driving systems.
49. 【2510.23203】DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification
链接:https://arxiv.org/abs/2510.23203
作者:Lukas Bierling,Davide Pasero,Fleur Dolmans,Helia Ghasemi,Angelo Broere
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fidelity human object, human object interaction, Accurate vertex-level contact, high fidelity human, Accurate vertex-level
备注:
点击查看摘要
Abstract:Accurate vertex-level contact prediction between humans and surrounding objects is a prerequisite for high fidelity human object interaction models used in robotics, AR/VR, and behavioral simulation. DECO was the first in the wild estimator for this task but is limited to binary contact maps and struggles with soft surfaces, occlusions, children, and false-positive foot contacts. We address these issues and introduce DecoDINO, a three-branch network based on DECO's framework. It uses two DINOv2 ViT-g/14 encoders, class-balanced loss weighting to reduce bias, and patch-level cross-attention for improved local reasoning. Vertex features are finally passed through a lightweight MLP with a softmax to assign semantic contact labels. We also tested a vision-language model (VLM) to integrate text features, but the simpler architecture performed better and was used instead. On the DAMON benchmark, DecoDINO (i) raises the binary-contact F1 score by 7$\%$, (ii) halves the geodesic error, and (iii) augments predictions with object-level semantic labels. Ablation studies show that LoRA fine-tuning and the dual encoders are key to these improvements. DecoDINO outperformed the challenge baseline in both tasks of the DAMON Challenge. Our code is available at this https URL.
50. 【2510.23190】Evaluation of Vision-LLMs in Surveillance Video
链接:https://arxiv.org/abs/2510.23190
作者:Pascal Benschop,Cristian Meo,Justin Dauwels,Jelte P. Mense
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human monitoring, society has created, created an overwhelming, overwhelming amount, capacity for human
备注: Accepted as poster in the NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI
点击查看摘要
Abstract:The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: this https URL
51. 【2510.23184】Finding 3D Scene Analogies with Multimodal Foundation Models
链接:https://arxiv.org/abs/2510.23184
作者:Junho Kim,Young Min Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Connecting current observations, Connecting current, current observations, observations with prior, prior experiences
备注: Accepted to FM4RoboPlan workshop at RSS 2025
点击查看摘要
Abstract:Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, which are smooth maps that align scene regions with common spatial relationships. These maps enable detailed transfer of trajectories or waypoints, potentially supporting demonstration transfer for imitation learning or task plan transfer across scenes. However, existing methods for the task require additional training and fixed object vocabularies. In this work, we propose to use multimodal foundation models for finding 3D scene analogies in a zero-shot, open-vocabulary setting. Central to our approach is a hybrid neural representation of scenes that consists of a sparse graph based on vision-language model features and a feature field derived from 3D shape foundation models. 3D scene analogies are then found in a coarse-to-fine manner, by first aligning the graph and refining the correspondence with feature fields. Our method can establish accurate correspondences between complex scenes, and we showcase applications in trajectory and waypoint transfer.
52. 【2510.23151】AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes
链接:https://arxiv.org/abs/2510.23151
作者:Sixian Liu,Chen Xu,Qiang Wang,Donghai Shi,Yiwen Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal camera-LiDAR fusion, found extensive application, Multimodal camera-LiDAR, camera-LiDAR fusion technology, Adaptive Gated Fusion
备注:
点击查看摘要
Abstract:Multimodal camera-LiDAR fusion technology has found extensive application in 3D object detection, demonstrating encouraging performance. However, existing methods exhibit significant performance degradation in challenging scenarios characterized by sensor degradation or environmental disturbances. We propose a novel Adaptive Gated Fusion (AG-Fusion) approach that selectively integrates cross-modal knowledge by identifying reliable patterns for robust detection in complex scenes. Specifically, we first project features from each modality into a unified BEV space and enhance them using a window-based attention mechanism. Subsequently, an adaptive gated fusion module based on cross-modal attention is designed to integrate these features into reliable BEV representations robust to challenging environments. Furthermore, we construct a new dataset named Excavator3D (E3D) focusing on challenging excavator operation scenarios to benchmark performance in complex conditions. Our method not only achieves competitive performance on the standard KITTI dataset with 93.92% accuracy, but also significantly outperforms the baseline by 24.88% on the challenging E3D dataset, demonstrating superior robustness to unreliable modal information in complex industrial scenes.
53. 【2510.23145】Implicit Modeling for Transferability Estimation of Vision Foundation Models
链接:https://arxiv.org/abs/2510.23145
作者:Yaoyan Zheng,Huiqun Wang,Nan Zhou,Di Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high computational cost, Transferability estimation identifies, estimation identifies, incurring the high, high computational
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model's intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark--spanning extensive training regimes and a wider variety of model types--demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.
54. 【2510.23144】DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios
链接:https://arxiv.org/abs/2510.23144
作者:Ziyu Wang,Wenhao Li,Ji Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:garnered significant attention, recent years, multi-view images, images in traffic, traffic scenarios
备注:
点击查看摘要
Abstract:3D object detection from multi-view images in traffic scenarios has garnered significant attention in recent years. Many existing approaches rely on object queries that are generated from 3D reference points to localize objects. However, a limitation of these methods is that some reference points are often far from the target object, which can lead to false positive detections. In this paper, we propose a depth-guided query generator for 3D object detection (DQ3D) that leverages depth information and 2D detections to ensure that reference points are sampled from the surface or interior of the object. Furthermore, to address partially occluded objects in current frame, we introduce a hybrid attention mechanism that fuses historical detection results with depth-guided queries, thereby forming hybrid queries. Evaluation on the nuScenes dataset demonstrates that our method outperforms the baseline by 6.3\% in terms of mean Average Precision (mAP) and 4.3\% in the NuScenes Detection Score (NDS).
55. 【2510.23140】Fast Voxel-Wise Kinetic Modeling in Dynamic PET using a Physics-Informed CycleGAN
链接:https://arxiv.org/abs/2510.23140
作者:Christian Salomonsen,Samuel Kuttner,Michael Kampffmeyer,Robert Jenssen,Kristoffer Wickstrøm,Jong Chul Ye,Elisabeth Wetzer
类目:Computer Vision and Pattern Recognition (cs.CV); Other Quantitative Biology (q-bio.OT)
关键词:Tracer kinetic modeling, input function estimation, kinetic modeling serves, invasive arterial input, arterial input function
备注: 5 pages, 1 figure. Pre-review preprint. Submitted to MedEurIPS 2025 (EurIPS workshop)
点击查看摘要
Abstract:Tracer kinetic modeling serves a vital role in diagnosis, treatment planning, tracer development and oncology, but burdens practitioners with complex and invasive arterial input function estimation (AIF). We adopt a physics-informed CycleGAN showing promise in DCE-MRI quantification to dynamic PET quantification. Our experiments demonstrate sound AIF predictions and parameter maps closely resembling the reference.
56. 【2510.23137】Note on the Construction of Structure Tensor
链接:https://arxiv.org/abs/2510.23137
作者:Josef Bigun,Fernado Alonso-Fernandez
类目:Computer Vision and Pattern Recognition (cs.CV); Spectral Theory (math.SP)
关键词:proposed by Bigun, Granlund and Knutsson, note presents, presents a theoretical, theoretical discussion
备注:
点击查看摘要
Abstract:This note presents a theoretical discussion of two structure tensor constructions: one proposed by Bigun and Granlund 1987, and the other by Granlund and Knutsson 1995. At first glance, these approaches may appear quite different--the former is implemented by averaging outer products of gradient filter responses, while the latter constructs the tensor from weighted outer products of tune-in frequency vectors of quadrature filters. We argue that when both constructions are viewed through the common lens of Total Least Squares (TLS) line fitting to the power spectrum, they can be reconciled to a large extent, and additional benefits emerge. From this perspective, the correction term introduced in Granlund and Knutsson 1995 becomes unnecessary. Omitting it ensures that the resulting tensor remains positive semi-definite, thereby simplifying the interpretation of its eigenvalues. Furthermore, this interpretation allows fitting more than a single 0rientation to the input by reinterpreting quadrature filter responses without relying on a structure tensor. It also removes the constraint that responses must originate strictly from quadrature filters, allowing the use of alternative filter types and non-angular tessellations. These alternatives include Gabor filters--which, although not strictly quadrature, are still suitable for structure tensor construction--even when they tessellate the spectrum in a Cartesian fashion, provided they are sufficiently concentrated.
57. 【2510.23124】DeepSalt: Bridging Laboratory and Satellite Spectra through Domain Adaptation and Knowledge Distillation for Large-Scale Soil Salinity Estimation
链接:https://arxiv.org/abs/2510.23124
作者:Rupasree Dey,Abdul Matin,Everett Lewark,Tanjim Bin Faruk,Andrei Bachinin,Sam Leuthold,M. Francesca Cotrufo,Shrideep Pallickara,Sangmi Lee Pallickara
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:reduces crop productivity, Soil salinization poses, limits plants' ability, Spectral Adaptation Unit, reduces crop
备注:
点击查看摘要
Abstract:Soil salinization poses a significant threat to both ecosystems and agriculture because it limits plants' ability to absorb water and, in doing so, reduces crop productivity. This phenomenon alters the soil's spectral properties, creating a measurable relationship between salinity and light reflectance that enables remote monitoring. While laboratory spectroscopy provides precise measurements, its reliance on in-situ sampling limits scalability to regional or global levels. Conversely, hyperspectral satellite imagery enables wide-area observation but lacks the fine-grained interpretability of laboratory instruments. To bridge this gap, we introduce DeepSalt, a deep-learning-based spectral transfer framework that leverages knowledge distillation and a novel Spectral Adaptation Unit to transfer high-resolution spectral insights from laboratory-based spectroscopy to satellite-based hyperspectral sensing. Our approach eliminates the need for extensive ground sampling while enabling accurate, large-scale salinity estimation, as demonstrated through comprehensive empirical benchmarks. DeepSalt achieves significant performance gains over methods without explicit domain adaptation, underscoring the impact of the proposed Spectral Adaptation Unit and the knowledge distillation strategy. The model also effectively generalized to unseen geographic regions, explaining a substantial portion of the salinity variance.
58. 【2510.23118】ask-Agnostic Fusion of Time Series and Imagery for Earth Observation
链接:https://arxiv.org/abs/2510.23118
作者:Gianfranco Basile,Johannes Jakubik,Benedikt Blumenstiel,Thomas Brunschwiler,Juan Bernabe Moreno
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling cross-modal generation, single timestamp images, robust downstream performance, time series, time series quantization
备注:
点击查看摘要
Abstract:We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50\% in R$^2$ and 12\% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.
59. 【2510.23117】Seeing Structural Failure Before it Happens: An Image-Based Physics-Informed Neural Network (PINN) for Spaghetti Bridge Load Prediction
链接:https://arxiv.org/abs/2510.23117
作者:Omer Jauhar Khan,Sudais Khan,Hafeez Anwar
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Informed Neural Networks, Physics Informed Neural, Neural Networks, Informed Neural, Physics Informed Kolmogorov
备注: 12 pages, 17 figures. Preprint
点击查看摘要
Abstract:Physics Informed Neural Networks (PINNs) are gaining attention for their ability to embed physical laws into deep learning models, which is particularly useful in structural engineering tasks with limited data. This paper aims to explore the use of PINNs to predict the weight of small scale spaghetti bridges, a task relevant to understanding load limits and potential failure modes in simplified structural models. Our proposed framework incorporates physics-based constraints to the prediction model for improved performance. In addition to standard PINNs, we introduce a novel architecture named Physics Informed Kolmogorov Arnold Network (PIKAN), which blends universal function approximation theory with physical insights. The structural parameters provided as input to the model are collected either manually or through computer vision methods. Our dataset includes 15 real bridges, augmented to 100 samples, and our best model achieves an $R^2$ score of 0.9603 and a mean absolute error (MAE) of 10.50 units. From applied perspective, we also provide a web based interface for parameter entry and prediction. These results show that PINNs can offer reliable estimates of structural weight, even with limited data, and may help inform early stage failure analysis in lightweight bridge designs. The complete data and code are available at this https URL.
Comments:
12 pages, 17 figures. Preprint
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:
65M70 (Primary), 68T07 (Secondary)
ACMclasses:
I.2.6; I.4.8; G.1.8
Cite as:
arXiv:2510.23117 [cs.LG]
(or
arXiv:2510.23117v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2510.23117
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
60. 【2510.23116】Residual Diffusion Bridge Model for Image Restoration
链接:https://arxiv.org/abs/2510.23116
作者:Hebaixu Wang,Jing Zhang,Haoyang Chen,Haonan Guo,Di Wang,Jiayi Ma,Bo Du
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:establish probabilistic paths, exhibit great potential, models establish probabilistic, arbitrary paired distributions, bridge models establish
备注:
点击查看摘要
Abstract:Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at this https URL.
61. 【2510.23095】Revisiting Multimodal Positional Encoding in Vision-Language Models
链接:https://arxiv.org/abs/2510.23095
作者:Jie Huang,Xuejing Liu,Sibo Song,Ruibing Hou,Hong Chang,Junyang Lin,Shuai Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal position encoding, position encoding, Rotary Positional Embedding, multimodal Rotary Positional, vision-language models
备注: 16 pages
点击查看摘要
Abstract:Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at this https URL.
62. 【2510.23087】EndoWave: Rational-Wavelet 4D Gaussian Splatting for Endoscopic Reconstruction
链接:https://arxiv.org/abs/2510.23087
作者:Taoyu Wu,Yiyi Miao,Jiaxin Guo,Ziyan Chen,Sihang Zhao,Zhuoxiao Li,Zhe Tang,Baoru Huang,Limin Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:minimally invasive surgery, robot-assisted minimally invasive, invasive surgery, improved outcomes, robot-assisted minimally
备注:
点击查看摘要
Abstract:In robot-assisted minimally invasive surgery, accurate 3D reconstruction from endoscopic video is vital for downstream tasks and improved outcomes. However, endoscopic scenarios present unique challenges, including photometric inconsistencies, non-rigid tissue motion, and view-dependent highlights. Most 3DGS-based methods that rely solely on appearance constraints for optimizing 3DGS are often insufficient in this context, as these dynamic visual artifacts can mislead the optimization process and lead to inaccurate reconstructions. To address these limitations, we present EndoWave, a unified spatio-temporal Gaussian Splatting framework by incorporating an optical flow-based geometric constraint and a multi-resolution rational wavelet supervision. First, we adopt a unified spatio-temporal Gaussian representation that directly optimizes primitives in a 4D domain. Second, we propose a geometric constraint derived from optical flow to enhance temporal coherence and effectively constrain the 3D structure of the scene. Third, we propose a multi-resolution rational orthogonal wavelet as a constraint, which can effectively separate the details of the endoscope and enhance the rendering performance. Extensive evaluations on two real surgical datasets, EndoNeRF and StereoMIS, demonstrate that our method EndoWave achieves state-of-the-art reconstruction quality and visual accuracy compared to the baseline method.
63. 【2510.23079】Strategies for Robust Deep Learning Based Deformable Registration
链接:https://arxiv.org/abs/2510.23079
作者:Joel Honkamaa,Pekka Marttinen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning based, learning based deformable, based deformable registration, deformable registration methods, Deep learning
备注:
点击查看摘要
Abstract:Deep learning based deformable registration methods have become popular in recent years. However, their ability to generalize beyond training data distribution can be poor, significantly hindering their usability. LUMIR brain registration challenge for Learn2Reg 2025 aims to advance the field by evaluating the performance of the registration on contrasts and modalities different from those included in the training set. Here we describe our submission to the challenge, which proposes a very simple idea for significantly improving robustness by transforming the images into MIND feature space before feeding them into the model. In addition, a special ensembling strategy is proposed that shows a small but consistent improvement.
64. 【2510.23057】Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation
链接:https://arxiv.org/abs/2510.23057
作者:Oskar Natan,Jun Miura
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
关键词:realworld environments, autonomous legged navigation, legged robot navigation, legged navigation, present Seq-DeepIPC
备注: Preprint notice, this manuscript has been submitted to IEEE sensors journal for possible publication
点击查看摘要
Abstract:We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repository at this https URL.
65. 【2510.23043】HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling
链接:https://arxiv.org/abs/2510.23043
作者:Joungbin An,Kristen Grauman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language query, requires capturing, Video temporal grounding, task of localizing, localizing the start
备注: Project Page: [this https URL](https://vision.cs.utexas.edu/projects/hieramamba/)
点击查看摘要
Abstract:Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine-grained temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which utilize Mamba's selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.
66. 【2510.23028】Nested AutoRegressive Models
链接:https://arxiv.org/abs/2510.23028
作者:Hongyu Wu,Xuhui Fan,Zhangkai Wu,Longbing Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieving results comparable, achieving results, results comparable, image generation, NestAR
备注:
点击查看摘要
Abstract:AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.
67. 【2510.23023】UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization
链接:https://arxiv.org/abs/2510.23023
作者:Huixuan Zhang,Xiaojun Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:generative models, significant concern, rapid proliferation, authenticity of digital, image generative models
备注:
点击查看摘要
Abstract:With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.
68. 【2510.23020】M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
链接:https://arxiv.org/abs/2510.23020
作者:Huixuan Zhang,Xiaojun Wan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:struggle with generating, generating images, images that perfectly, textual prompts, perfectly align
备注:
点击查看摘要
Abstract:Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}
69. 【2510.23009】UGAE: Unified Geometry and Attribute Enhancement for G-PCC Compressed Point Clouds
链接:https://arxiv.org/abs/2510.23009
作者:Pan Zhao,Hui Yuan,Chongzhen Tian,Tian Guo,Raouf Hamzaoui,Zhigeng Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:point clouds reduces, clouds reduces storage, Lossy compression, transmission costs, point clouds
备注:
点击查看摘要
Abstract:Lossy compression of point clouds reduces storage and transmission costs; however, it inevitably leads to irreversible distortion in geometry structure and attribute information. To address these issues, we propose a unified geometry and attribute enhancement (UGAE) framework, which consists of three core components: post-geometry enhancement (PoGE), pre-attribute enhancement (PAE), and post-attribute enhancement (PoAE). In PoGE, a Transformer-based sparse convolutional U-Net is used to reconstruct the geometry structure with high precision by predicting voxel occupancy probabilities. Building on the refined geometry structure, PAE introduces an innovative enhanced geometry-guided recoloring strategy, which uses a detail-aware K-Nearest Neighbors (DA-KNN) method to achieve accurate recoloring and effectively preserve high-frequency details before attribute compression. Finally, at the decoder side, PoAE uses an attribute residual prediction network with a weighted mean squared error (W-MSE) loss to enhance the quality of high-frequency regions while maintaining the fidelity of low-frequency regions. UGAE significantly outperformed existing methods on three benchmark datasets: 8iVFB, Owlii, and MVUB. Compared to the latest G-PCC test model (TMC13v29), UGAE achieved an average BD-PSNR gain of 9.98 dB and 90.98% BD-bitrate savings for geometry under the D1 metric, as well as a 3.67 dB BD-PSNR improvement with 56.88% BD-bitrate savings for attributes on the Y component. Additionally, it improved perceptual quality significantly.
70. 【2510.23007】CoMo: Compositional Motion Customization for Text-to-Video Generation
链接:https://arxiv.org/abs/2510.23007
作者:Youcan Xu,Zhen Wang,Jiaxin Shi,Kexin Li,Feifei Shao,Jun Xiao,Yi Yang,Jun Yu,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating diverse scenes, precise motion control, models excel, diverse scenes, excel at generating
备注:
点击查看摘要
Abstract:While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $\textbf{compositional motion customization}$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy composes these learned motions without additional training by spatially isolating their influence during the denoising process. To facilitate research in this new domain, we also introduce a new benchmark and a novel evaluation metric designed to assess multi-motion fidelity and blending. Extensive experiments demonstrate that CoMo achieves state-of-the-art performance, significantly advancing the capabilities of controllable video generation. Our project page is at this https URL.
71. 【2510.23003】An Intelligent Water-Saving Irrigation System Based on Multi-Sensor Fusion and Visual Servoing Control
链接:https://arxiv.org/abs/2510.23003
作者:ZhengKai Huang,YiKun Wang,ChenYu Hui,XiaoCheng
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
关键词:address critical challenges, poor terrain adaptability, intelligent water-saving irrigation, precision agriculture, paper introduces
备注:
点击查看摘要
Abstract:This paper introduces an intelligent water-saving irrigation system designed to address critical challenges in precision agriculture, such as inefficient water use and poor terrain adaptability. The system integrates advanced computer vision, robotic control, and real-time stabilization technologies via a multi-sensor fusion approach. A lightweight YOLO model, deployed on an embedded vision processor (K210), enables real-time plant container detection with over 96% accuracy under varying lighting conditions. A simplified hand-eye calibration algorithm-designed for 'handheld camera' robot arm configurations-ensures that the end effector can be precisely positioned, with a success rate exceeding 90%. The active leveling system, driven by the STM32F103ZET6 main control chip and JY901S inertial measurement data, can stabilize the irrigation platform on slopes up to 10 degrees, with a response time of 1.8 seconds. Experimental results across three simulated agricultural environments (standard greenhouse, hilly terrain, complex lighting) demonstrate a 30-50% reduction in water consumption compared to conventional flood irrigation, with water use efficiency exceeding 92% in all test cases.
72. 【2510.22995】LoMix: Learnable Weighted Multi-Scale Logits Mixing for Medical Image Segmentation
链接:https://arxiv.org/abs/2510.22995
作者:Md Mostafijur Rahman,Radu Marculescu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple spatial scales, multiple spatial, spatial scales, fine detail, logits
备注: 25 pages, 13 figures, NeurIPS 2025 accepted paper
点击查看摘要
Abstract:U-shaped networks output logits at multiple spatial scales, each capturing a different blend of coarse context and fine detail. Yet, training still treats these logits in isolation - either supervising only the final, highest-resolution logits or applying deep supervision with identical loss weights at every scale - without exploring mixed-scale combinations. Consequently, the decoder output misses the complementary cues that arise only when coarse and fine predictions are fused. To address this issue, we introduce LoMix (Logits Mixing), a NAS-inspired, differentiable plug-and-play module that generates new mixed-scale outputs and learns how exactly each of them should guide the training process. More precisely, LoMix mixes the multi-scale decoder logits with four lightweight fusion operators: addition, multiplication, concatenation, and attention-based weighted fusion, yielding a rich set of synthetic mutant maps. Every original or mutant map is given a softplus loss weight that is co-optimized with network parameters, mimicking a one-step architecture search that automatically discovers the most useful scales, mixtures, and operators. Plugging LoMix into recent U-shaped architectures (i.e., PVT-V2-B2 backbone with EMCAD decoder) on Synapse 8-organ dataset improves DICE by +4.2% over single-output supervision, +2.2% over deep supervision, and +1.5% over equally weighted additive fusion, all with zero inference overhead. When training data are scarce (e.g., one or two labeled scans), the advantage grows to +9.23%, underscoring LoMix's data efficiency. Across four benchmarks and diverse U-shaped networks, LoMiX improves DICE by up to +13.5% over single-output supervision, confirming that learnable weighted mixed-scale fusion generalizes broadly while remaining data efficient, fully interpretable, and overhead-free at inference. Our code is available at this https URL.
73. 【2510.22994】SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency
链接:https://arxiv.org/abs/2510.22994
作者:Quanjian Song,Donghao Zhou,Jingyu Lin,Fei Shen,Jiaze Wang,Xiaowei Hu,Cunjian Chen,Pheng-Ann Heng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:revolutionized image generation, maintaining concept consistency, models have revolutionized, revolutionized image, scene consistency
备注: Accepted by NeurIPS 2025; Project Page: [this https URL](https://lulupig12138.github.io/SceneDecorator)
点击查看摘要
Abstract:Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial role of scenes in storytelling, which restricts their creativity in practice. This paper introduces scene-oriented story generation, addressing two key challenges: (i) scene planning, where current methods fail to ensure scene-level narrative coherence by relying solely on text descriptions, and (ii) scene consistency, which remains largely unexplored in terms of maintaining scene consistency across multiple stories. We propose SceneDecorator, a training-free framework that employs VLM-Guided Scene Planning to ensure narrative coherence across different scenes in a ``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain long-term scene consistency and subject diversity across generated stories. Extensive experiments demonstrate the superior performance of SceneDecorator, highlighting its potential to unleash creativity in the fields of arts, films, and games.
74. 【2510.22981】Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction
链接:https://arxiv.org/abs/2510.22981
作者:Jin Hu,Jiakai Wang,Linna Jing,Haolin Li,Haodong Liu,Haotong Qin,Aishan Liu,Ke Xu,Xianglong Liu
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:future research due, flexible attacking forms, natural language instructions, directly generated, generated from natural
备注: NeurIPS 2025
点击查看摘要
Abstract:Recently, semantically constrained adversarial examples (SemanticAE), which are directly generated from natural language instructions, have become a promising avenue for future research due to their flexible attacking forms. To generate SemanticAEs, current methods fall short of satisfactory attacking ability as the key underlying factors of semantic uncertainty in human instructions, such as referring diversity, descriptive incompleteness, and boundary ambiguity, have not been fully investigated. To tackle the issues, this paper develops a multi-dimensional instruction uncertainty reduction (InSUR) framework to generate more satisfactory SemanticAE, i.e., transferable, adaptive, and effective. Specifically, in the dimension of the sampling method, we propose the residual-driven attacking direction stabilization to alleviate the unstable adversarial optimization caused by the diversity of language references. By coarsely predicting the language-guided sampling process, the optimization process will be stabilized by the designed ResAdv-DDIM sampler, therefore releasing the transferable and robust adversarial capability of multi-step diffusion models. In task modeling, we propose the context-encoded attacking scenario constraint to supplement the missing knowledge from incomplete human instructions. Guidance masking and renderer integration are proposed to regulate the constraints of 2D/3D SemanticAE, activating stronger scenario-adapted attacks. Moreover, in the dimension of generator evaluation, we propose the semantic-abstracted attacking evaluation enhancement by clarifying the evaluation boundary, facilitating the development of more effective SemanticAE generators. Extensive experiments demonstrate the superiority of the transfer attack performance of InSUR. Moreover, we realize the reference-free generation of semantically constrained 3D adversarial examples for the first time.
75. 【2510.22975】VoMP: Predicting Volumetric Mechanical Property Fields
链接:https://arxiv.org/abs/2510.22975
作者:Rishit Dagli,Donglai Xiang,Vismay Modi,Charles Loop,Clement Fuji Tsang,Anka He Chen,Anita Hu,Gavriel State,David I.W. Levin,Maria Shugrina
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:Physical simulation relies, Physical simulation, spatially-varying mechanical properties, laboriously hand-crafted, simulation relies
备注: hi-res paper and other details at: [this https URL](https://research.nvidia.com/labs/sil/projects/vomp)
点击查看摘要
Abstract:Physical simulation relies on spatially-varying mechanical properties, often laboriously hand-crafted. VoMP is a feed-forward method trained to predict Young's modulus ($E$), Poisson's ratio ($\nu$), and density ($\rho$) throughout the volume of 3D objects, in any representation that can be rendered and voxelized. VoMP aggregates per-voxel multi-view features and passes them to our trained Geometry Transformer to predict per-voxel material latent codes. These latents reside on a manifold of physically plausible materials, which we learn from a real-world dataset, guaranteeing the validity of decoded per-voxel materials. To obtain object-level training data, we propose an annotation pipeline combining knowledge from segmented 3D datasets, material databases, and a vision-language model, along with a new benchmark. Experiments show that VoMP estimates accurate volumetric properties, far outperforming prior art in accuracy and speed.
76. 【2510.22973】Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method
链接:https://arxiv.org/abs/2510.22973
作者:Bohan Li,Xin Jin,Hu Zhu,Hongsi Liu,Ruikai Li,Jiazhe Guo,Kaiwen Cai,Chao Ma,Yueming Jin,Hao Zhao,Xiaokang Yang,Wenjun Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling downstream applications, including perception, planning evaluation, critical domain, perception and planning
备注: [this https URL](https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2)
点击查看摘要
Abstract:Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: this https URL
77. 【2510.22970】VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
链接:https://arxiv.org/abs/2510.22970
作者:Zhangkai Wu,Xuhui Fan,Zhongyuan Xie,Kaize Shi,Longbing Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, textbf, precise cross-frame generation, leveraging pre-trained, enabled lightweight
备注:
点击查看摘要
Abstract:Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.
78. 【2510.22964】Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges
链接:https://arxiv.org/abs/2510.22964
作者:Liling Yang,Ning Chen,Jun Yue,Yidan Liu,Jiayi Ma,Pedram Ghamisi,Antonio Plaza,Leyuan Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:transformed natural language, natural language processing, sensing image analysis, reshaping remote sensing, remote sensing image
备注:
点击查看摘要
Abstract:Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.
79. 【2510.22960】FAME: Fairness-aware Attention-modulated Video Editing
链接:https://arxiv.org/abs/2510.22960
作者:Zhangkai Wu,Xuhui Fan,Zhongyuan Xie,Kaize Shi,Zhidong Li,Longbing Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Training-free video editing, Attention-modulated Video Editing, Fairness-aware Attention-modulated Video, Training-free video, video editing
备注:
点击查看摘要
Abstract:Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.
80. 【2510.22946】LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
链接:https://arxiv.org/abs/2510.22946
作者:Zeyu Wang,Zilong Chen,Chenhui Gou,Feng Li,Chaorui Deng,Deyao Zhu,Kunchang Li,Weihao Yu,Haoqin Tu,Haoqi Fan,Cihang Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:substantial computational resources, recently shown remarkable, shown remarkable gains, require substantial computational, capability and versatility
备注: Preprint. Project page: [this https URL](https://ucsc-vlaa.github.io/LightBagel/)
点击查看摘要
Abstract:Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
81. 【2510.22943】Switchable Token-Specific Codebook Quantization For Face Image Compression
链接:https://arxiv.org/abs/2510.22943
作者:Yongbo Wang,Haonan Wang,Guodong Mu,Ruixin Zhang,Jiaqi Chen,Jingyun Zhang,Jun Wang,Yuan Xie,Zhizhong Zhang,Shouhong Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern information systems, visual data, lossless transmission, interpretation and understanding, information systems
备注:
点击查看摘要
Abstract:With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.
82. 【2510.22937】Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics
链接:https://arxiv.org/abs/2510.22937
作者:Matthew So,Judah Goldfeder,Mark Lis,Hod Lipson
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:statistically uncorrelated, Vision Transformer backbones, historic assumption, matching, ROC AUC score
备注:
点击查看摘要
Abstract:There has been a historic assumption that the biometrics of an individual are statistically uncorrelated. We test this assumption by training Bi-Encoder networks on three verification tasks, including fingerprint-to-fingerprint matching, iris-to-iris matching, and cross-modal fingerprint-to-iris matching using 274 subjects with $\sim$100k fingerprints and 7k iris images. We trained ResNet-50 and Vision Transformer backbones in Bi-Encoder architectures such that the contrastive loss between images sampled from the same individual is minimized. The iris ResNet architecture reaches 91 ROC AUC score for iris-to-iris matching, providing clear evidence that the left and right irises of an individual are correlated. Fingerprint models reproduce the positive intra-subject suggested by prior work in this space. This is the first work attempting to use Vision Transformers for this matching. Cross-modal matching rises only slightly above chance, which suggests that more data and a more sophisticated pipeline is needed to obtain compelling results. These findings continue challenge independence assumptions of biometrics and we plan to extend this work to other biometrics in the future. Code available: this https URL.
83. 【2510.22936】Positional Preservation Embedding for Multimodal Large Language Models
链接:https://arxiv.org/abs/2510.22936
作者:Mouxiao Huang,Borui Jiang,Dehua Zheng,Hailin Hu,Kai Han,Xinghao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, large language models, Multimodal large, achieved strong performance, language models
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering -- a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2\%\sim5\%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning.
84. 【2510.22930】Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression
链接:https://arxiv.org/abs/2510.22930
作者:Pranav Saxena
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:intuitive human-AI interaction, Modeling open-vocabulary language, physical environments, Modeling open-vocabulary, essential for intuitive
备注:
点击查看摘要
Abstract:Modeling open-vocabulary language fields in 3D is essential for intuitive human-AI interaction and querying within physical environments. State-of-the-art approaches, such as LangSplat, leverage 3D Gaussian Splatting to efficiently construct these language fields, encoding features distilled from high-dimensional models like CLIP. However, this efficiency is currently offset by the requirement to train a scene-specific language autoencoder for feature compression, introducing a costly, per-scene optimization bottleneck that hinders deployment scalability. In this work, we introduce Gen-LangSplat, that eliminates this requirement by replacing the scene-wise autoencoder with a generalized autoencoder, pre-trained extensively on the large-scale ScanNet dataset. This architectural shift enables the use of a fixed, compact latent space for language features across any new scene without any scene-specific training. By removing this dependency, our entire language field construction process achieves a efficiency boost while delivering querying performance comparable to, or exceeding, the original LangSplat method. To validate our design choice, we perform a thorough ablation study empirically determining the optimal latent embedding dimension and quantifying representational fidelity using Mean Squared Error and cosine similarity between the original and reprojected 512-dimensional CLIP embeddings. Our results demonstrate that generalized embeddings can efficiently and accurately support open-vocabulary querying in novel 3D scenes, paving the way for scalable, real-time interactive 3D AI applications.
85. 【2510.22916】Estimating Pasture Biomass from Top-View Images: A Dataset for Precision Agriculture
链接:https://arxiv.org/abs/2510.22916
作者:Qiyu Liao,Dadong Wang,Rebecca Haling,Jiajun Liu,Xun Li,Martyna Plomecka,Andrew Robson,Matthew Pringle,Rhys Pirie,Megan Walker,Joshua Whelan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:livestock production systems, Accurate estimation, important for decision-making, decision-making in livestock, livestock production
备注: 9 pages, 2 figures, 2 tables, The dataset is available on the official Kaggle webpage: [this https URL](https://www.kaggle.com/competitions/csiro-biomass)
点击查看摘要
Abstract:Accurate estimation of pasture biomass is important for decision-making in livestock production systems. Estimates of pasture biomass can be used to manage stocking rates to maximise pasture utilisation, while minimising the risk of overgrazing and promoting overall system health. We present a comprehensive dataset of 1,162 annotated top-view images of pastures collected across 19 locations in Australia. The images were taken across multiple seasons and include a range of temperate pasture species. Each image captures a 70cm * 30cm quadrat and is paired with on-ground measurements including biomass sorted by component (green, dead, and legume fraction), vegetation height, and Normalized Difference Vegetation Index (NDVI) from Active Optical Sensors (AOS). The multidimensional nature of the data, which combines visual, spectral, and structural information, opens up new possibilities for advancing the use of precision grazing management. The dataset is released and hosted in a Kaggle competition that challenges the international Machine Learning community with the task of pasture biomass estimation. The dataset is available on the official Kaggle webpage: this https URL
86. 【2510.22868】Seeing the Unseen: Towards Zero-Shot Inspection for Wind Turbine Blades using Knowledge-Augmented Vision Language Models
链接:https://arxiv.org/abs/2510.22868
作者:Yang Zhang,Qianyu Zhou,Farhad Imani,Jiong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Wind turbine blades, making timely damage, Wind turbine, turbine blades operate, harsh environments
备注:
点击查看摘要
Abstract:Wind turbine blades operate in harsh environments, making timely damage detection essential for preventing failures and optimizing maintenance. Drone-based inspection and deep learning are promising, but typically depend on large, labeled datasets, which limit their ability to detect rare or evolving damage types. To address this, we propose a zero-shot-oriented inspection framework that integrates Retrieval-Augmented Generation (RAG) with Vision-Language Models (VLM). A multimodal knowledge base is constructed, comprising technical documentation, representative reference images, and domain-specific guidelines. A hybrid text-image retriever with keyword-aware reranking assembles the most relevant context to condition the VLM at inference, injecting domain knowledge without task-specific training. We evaluate the framework on 30 labeled blade images covering diverse damage categories. Although the dataset is small due to the difficulty of acquiring verified blade imagery, it covers multiple representative defect types. On this test set, the RAG-grounded VLM correctly classified all samples, whereas the same VLM without retrieval performed worse in both accuracy and precision. We further compare against open-vocabulary baselines and incorporate uncertainty Clopper-Pearson confidence intervals to account for the small-sample setting. Ablation studies indicate that the key advantage of the framework lies in explainability and generalizability: retrieved references ground the reasoning process and enable the detection of previously unseen defects by leveraging domain knowledge rather than relying solely on visual cues. This research contributes a data-efficient solution for industrial inspection that reduces dependence on extensive labeled datasets.
87. 【2510.22851】Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models
链接:https://arxiv.org/abs/2510.22851
作者:Lexiang Xiong,Chengyu Liu,Jingwen Ye,Yan Liu,Yuecong Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:compromise generative quality, mitigating harmful content, Semantic Surgery, models is crucial, crucial for mitigating
备注: Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Code is available at [this https URL](https://github.com/Lexiang-Xiong/Semantic-Surgery)
点击查看摘要
Abstract:Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.
88. 【2510.22842】FastJAM: a Fast Joint Alignment Model for Images
链接:https://arxiv.org/abs/2510.22842
作者:Omri Hirsch,Ron Shapira Weber,Shira Ifergane,Oren Freifeld
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unified coordinate frame, coordinate frame, spatial locations, aims to align, align a collection
备注: Accepted to NeurIPS 2025. Pages 1-10 are the Main Paper. Pages 23-31 are Supplemental Material. FastJAM website - [this https URL](https://bgu-cs-vil.github.io/FastJAM/)
点击查看摘要
Abstract:Joint Alignment (JA) of images aims to align a collection of images into a unified coordinate frame, such that semantically-similar features appear at corresponding spatial locations. Most existing approaches often require long training times, large-capacity models, and extensive hyperparameter tuning. We introduce FastJAM, a rapid, graph-based method that drastically reduces the computational complexity of joint alignment tasks. FastJAM leverages pairwise matches computed by an off-the-shelf image matcher, together with a rapid nonparametric clustering, to construct a graph representing intra- and inter-image keypoint relations. A graph neural network propagates and aggregates these correspondences, efficiently predicting per-image homography parameters via image-level pooling. Utilizing an inverse-compositional loss, that eliminates the need for a regularization term over the predicted transformations (and thus also obviates the hyperparameter tuning associated with such terms), FastJAM performs image JA quickly and effectively. Experimental results on several benchmarks demonstrate that FastJAM achieves results better than existing modern JA methods in terms of alignment quality, while reducing computation time from hours or minutes to mere seconds. Our code is available at our project webpage, this https URL
89. 【2510.22838】Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
链接:https://arxiv.org/abs/2510.22838
作者:Aya Nakayama,Brian Wong,Yuji Nishimura,Kaito Tanaka
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, Vision-Language Models, challenge for Large, Large Vision-Language, hindering robust semantic
备注:
点击查看摘要
Abstract:The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR's state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation. Comprehensive evaluations, including ablation studies and generalization analysis, confirm SP-CSVR's efficacy in enhancing robustness, generalization, and efficiency across diverse visual styles.
90. 【2510.22829】LLM-based Fusion of Multi-modal Features for Commercial Memorability Prediction
链接:https://arxiv.org/abs/2510.22829
作者:Aleksandar Pramov
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:Predicting movie, workshop competition, commercial memorability, addresses the prediction, commercial
备注:
点击查看摘要
Abstract:This paper addresses the prediction of commercial (brand) memorability as part of "Subtask 2: Commercial/Ad Memorability" within the "Memorability: Predicting movie and commercial memorability" task at the MediaEval 2025 workshop competition. We propose a multimodal fusion system with a Gemma-3 LLM backbone that integrates pre-computed visual (ViT) and textual (E5) features by multi-modal projections. The model is adapted using Low-Rank Adaptation (LoRA). A heavily-tuned ensemble of gradient boosted trees serves as a baseline. A key contribution is the use of LLM-generated rationale prompts, grounded in expert-derived aspects of memorability, to guide the fusion model. The results demonstrate that the LLM-based system exhibits greater robustness and generalization performance on the final test set, compared to the baseline. The paper's codebase can be found at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Cite as:
arXiv:2510.22829 [cs.CV]
(or
arXiv:2510.22829v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.22829
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
91. 【2510.22827】FairJudge: MLLM Judging for Social Attributes and Prompt Image Alignment
链接:https://arxiv.org/abs/2510.22827
作者:Zahraa Al Sahili,Maryam Fetanat,Maimuna Nowaz,Ioannis Patras,Matthew Purver
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:images match prompts, systems lack simple, treat social attributes, models treat social, images match
备注:
点击查看摘要
Abstract:Text-to-image (T2I) systems lack simple, reproducible ways to evaluate how well images match prompts and how models treat social attributes. Common proxies -- face classifiers and contrastive similarity -- reward surface cues, lack calibrated abstention, and miss attributes only weakly visible (for example, religion, culture, disability). We present FairJudge, a lightweight protocol that treats instruction-following multimodal LLMs as fair judges. It scores alignment with an explanation-oriented rubric mapped to [-1, 1]; constrains judgments to a closed label set; requires evidence grounded in the visible content; and mandates abstention when cues are insufficient. Unlike CLIP-only pipelines, FairJudge yields accountable, evidence-aware decisions; unlike mitigation that alters generators, it targets evaluation fairness. We evaluate gender, race, and age on FairFace, PaTA, and FairCoT; extend to religion, culture, and disability; and assess profession correctness and alignment on IdenProf, FairCoT-Professions, and our new DIVERSIFY-Professions. We also release DIVERSIFY, a 469-image corpus of diverse, non-iconic scenes. Across datasets, judge models outperform contrastive and face-centric baselines on demographic prediction and improve mean alignment while maintaining high profession accuracy, enabling more reliable, reproducible fairness audits.
92. 【2510.22810】MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control
链接:https://arxiv.org/abs/2510.22810
作者:Fatemeh Nazarieh,Zhenhua Feng,Diptesh Kanojia,Muhammad Awais,Josef Kittler
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:gained significant attention, Audio-driven talking face, talking face generation, Audio-driven talking, virtual avatars
备注:
点击查看摘要
Abstract:Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.
93. 【2510.22803】MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering
链接:https://arxiv.org/abs/2510.22803
作者:Hai-Dang Nguyen,Minh-Anh Dang,Minh-Tan Le,Minh-Tuan Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual question answering, trust AI-generated diagnoses, physicians require transparent, Explainability is critical, medical visual question
备注: 10 pages, 4 figures, IEEE conference format
点击查看摘要
Abstract:Explainability is critical for the clinical adoption of medical visual question answering (VQA) systems, as physicians require transparent reasoning to trust AI-generated diagnoses. We present MedXplain-VQA, a comprehensive framework integrating five explainable AI components to deliver interpretable medical image analysis. The framework leverages a fine-tuned BLIP-2 backbone, medical query reformulation, enhanced Grad-CAM attention, precise region extraction, and structured chain-of-thought reasoning via multi-modal language models. To evaluate the system, we introduce a medical-domain-specific framework replacing traditional NLP metrics with clinically relevant assessments, including terminology coverage, clinical structure quality, and attention region relevance. Experiments on 500 PathVQA histopathology samples demonstrate substantial improvements, with the enhanced system achieving a composite score of 0.683 compared to 0.378 for baseline methods, while maintaining high reasoning confidence (0.890). Our system identifies 3-5 diagnostically relevant regions per sample and generates structured explanations averaging 57 words with appropriate clinical terminology. Ablation studies reveal that query reformulation provides the most significant initial improvement, while chain-of-thought reasoning enables systematic diagnostic processes. These findings underscore the potential of MedXplain-VQA as a robust, explainable medical VQA system. Future work will focus on validation with medical experts and large-scale clinical datasets to ensure clinical readiness.
94. 【2510.22785】Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models
链接:https://arxiv.org/abs/2510.22785
作者:Jiaxiang Liu,Jiawei Du,Xiao Liu,Prayag Tiwari,Mingkun Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pre-trained vision-language models, remain highly vulnerable, demonstrated strong zero-shot, strong zero-shot capabilities, disrupt image-text alignment
备注:
点击查看摘要
Abstract:Pre-trained vision-language models (VLMs) such as CLIP have demonstrated strong zero-shot capabilities across diverse domains, yet remain highly vulnerable to adversarial perturbations that disrupt image-text alignment and compromise reliability. Existing defenses typically rely on adversarial fine-tuning with labeled data, limiting their applicability in zero-shot settings. In this work, we identify two key weaknesses of current CLIP adversarial attacks -- lack of semantic guidance and vulnerability to view variations -- collectively termed semantic and viewpoint fragility. To address these challenges, we propose Self-Calibrated Consistency (SCC), an effective test-time defense. SCC consists of two complementary modules: Semantic consistency, which leverages soft pseudo-labels from counterattack warm-up and multi-view predictions to regularize cross-modal alignment and separate the target embedding from confusable negatives; and Spatial consistency, aligning perturbed visual predictions via augmented views to stabilize inference under adversarial perturbations. Together, these modules form a plug-and-play inference strategy. Extensive experiments on 22 benchmarks under diverse attack settings show that SCC consistently improves the zero-shot robustness of CLIP while maintaining accuracy, and can be seamlessly integrated with other VLMs for further gains. These findings highlight the great potential of establishing an adversarially robust paradigm from CLIP, with implications extending to broader vision-language domains such as BioMedCLIP.
95. 【2510.22743】ConMatFormer: A Multi-attention and Transformer Integrated ConvNext based Deep Learning Model for Enhanced Diabetic Foot Ulcer Classification
链接:https://arxiv.org/abs/2510.22743
作者:Raihan Ahamed Rifat,Fuyad Hasan Bhoyan,Md Humaion Kabir Mehedi,Md Kaviul Hossain,Md. Jakir Hossen,M. F. Mridha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging task due, challenging task, task due, scarcity and variability, variability of publicly
备注:
点击查看摘要
Abstract:Diabetic foot ulcer (DFU) detection is a clinically significant yet challenging task due to the scarcity and variability of publicly available datasets. To solve these problems, we propose ConMatFormer, a new hybrid deep learning architecture that combines ConvNeXt blocks, multiple attention mechanisms convolutional block attention module (CBAM) and dual attention network (DANet), and transformer modules in a way that works together. This design facilitates the extraction of better local features and understanding of the global context, which allows us to model small skin patterns across different types of DFU very accurately. To address the class imbalance, we used data augmentation methods. A ConvNeXt block was used to obtain detailed local features in the initial stages. Subsequently, we compiled the model by adding a transformer module to enhance long-range dependency. This enabled us to pinpoint the DFU classes that were underrepresented or constituted minorities. Tests on the DS1 (DFUC2021) and DS2 (diabetic foot ulcer (DFU)) datasets showed that ConMatFormer outperformed state-of-the-art (SOTA) convolutional neural network (CNN) and Vision Transformer (ViT) models in terms of accuracy, reliability, and flexibility. The proposed method achieved an accuracy of 0.8961 and a precision of 0.9160 in a single experiment, which is a significant improvement over the current standards for classifying DFUs. In addition, by 4-fold cross-validation, the proposed model achieved an accuracy of 0.9755 with a standard deviation of only 0.0031. We further applied explainable artificial intelligence (XAI) methods, such as Grad-CAM, Grad-CAM++, and LIME, to consistently monitor the transparency and trustworthiness of the decision-making process.. Our findings set a new benchmark for DFU classification and provide a hybrid attention transformer framework for medical image analysis.
96. 【2510.22736】Cross-view Localization and Synthesis -- Datasets, Challenges and Opportunities
链接:https://arxiv.org/abs/2510.22736
作者:Ningli Xu,Rongjun Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cross-view visual understanding, satellite or aerial, Cross-view, Cross-view localization, visual understanding
备注: 15 Figures
点击查看摘要
Abstract:Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via this https URL.
97. 【2510.22728】S-Chain: Structured Visual Chain-of-Thought For Medicine
链接:https://arxiv.org/abs/2510.22728
作者:Khai Le-Duc,Duy M. H. Nguyen,Phuong T. H. Trinh,Tien-Phat Nguyen,Nghiem T. Diep,An Ngo,Tung Vu,Trinh Vuong,Anh-Tien Nguyen,Mau Nguyen,Van Trung Hoang,Khai-Nguyen Nguyen,Hy Nguyen,Chris Ngo,Anji Liu,Nhat Ho,Anne-Christin Hauschild,Khanh Xuan Nguyen,Thanh Nguyen-Tang,Pengtao Xie,Daniel Sonntag,James Zou,Mathias Niepert,Anh Totti Nguyen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:medical vision-language models, Faithful reasoning, vision-language models, accurate predictions, textual rationales
备注: First version
点击查看摘要
Abstract:Faithful reasoning in medical vision-language models (VLMs) requires not only accurate predictions but also transparent alignment between textual rationales and visual evidence. While Chain-of-Thought (CoT) prompting has shown promise in medical visual question answering (VQA), no large-scale expert-level dataset has captured stepwise reasoning with precise visual grounding. We introduce S-Chain, the first large-scale dataset of 12,000 expert-annotated medical images with bounding boxes and structured visual CoT (SV-CoT), explicitly linking visual regions to reasoning steps. The dataset further supports 16 languages, totaling over 700k VQA pairs for broad multilingual applicability. Using S-Chain, we benchmark state-of-the-art medical VLMs (ExGra-Med, LLaVA-Med) and general-purpose VLMs (Qwen2.5-VL, InternVL2.5), showing that SV-CoT supervision significantly improves interpretability, grounding fidelity, and robustness. Beyond benchmarking, we study its synergy with retrieval-augmented generation, revealing how domain knowledge and visual grounding interact during autoregressive reasoning. Finally, we propose a new mechanism that strengthens the alignment between visual evidence and reasoning, improving both reliability and efficiency. S-Chain establishes a new benchmark for grounded medical reasoning and paves the way toward more trustworthy and explainable medical VLMs.
98. 【2510.22718】Edge Collaborative Gaussian Splatting with Integrated Rendering and Communication
链接:https://arxiv.org/abs/2510.22718
作者:Yujie Wan,Chenxuan Liu,Shuai Wang,Tong Zhang,James Jianqiao Yu,Kejiang Ye,Dusit Niyato,Chengzhong Xu
类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian splatting, degraded rendering quality, struggles with degraded, low-cost devices, quality on low-cost
备注: 5 pages and 7 figures, submitted for possible publication
点击查看摘要
Abstract:Gaussian splatting (GS) struggles with degraded rendering quality on low-cost devices. To address this issue, we present edge collaborative GS (ECO-GS), where each user can switch between a local small GS model to guarantee timeliness and a remote large GS model to guarantee fidelity. However, deciding how to engage the large GS model is nontrivial, due to the interdependency between rendering requirements and resource conditions. To this end, we propose integrated rendering and communication (IRAC), which jointly optimizes collaboration status (i.e., deciding whether to engage large GS) and edge power allocation (i.e., enabling remote rendering) under communication constraints across different users by minimizing a newly-derived GS switching function. Despite the nonconvexity of the problem, we propose an efficient penalty majorization minimization (PMM) algorithm to obtain the critical point solution. Furthermore, we develop an imitation learning optimization (ILO) algorithm, which reduces the computational time by over 100x compared to PMM. Experiments demonstrate the superiority of PMM and the real-time execution capability of ILO.
99. 【2510.22716】LRW-Persian: Lip-reading in the Wild Dataset for Persian Language
链接:https://arxiv.org/abs/2510.22716
作者:Zahra Taghizadeh,Mohammad Shahverdikondori,Arian Noori,Alireza Dadgarnia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:developing robust speech, increasingly important research, important research area, robust speech recognition, speech recognition systems
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Lipreading has emerged as an increasingly important research area for developing robust speech recognition systems and assistive technologies for the hearing-impaired. However, non-English resources for visual speech recognition remain limited. We introduce LRW-Persian, the largest in-the-wild Persian word-level lipreading dataset, comprising $743$ target words and over $414{,}000$ video samples extracted from more than $1{,}900$ hours of footage across $67$ television programs. Designed as a benchmark-ready resource, LRW-Persian provides speaker-disjoint training and test splits, wide regional and dialectal coverage, and rich per-clip metadata including head pose, age, and gender. To ensure large-scale data quality, we establish a fully automated end-to-end curation pipeline encompassing transcription based on Automatic Speech Recognition(ASR), active-speaker localization, quality filtering, and pose/mask screening. We further fine-tune two widely used lipreading architectures on LRW-Persian, establishing reference performance and demonstrating the difficulty of Persian visual speech recognition. By filling a critical gap in low-resource languages, LRW-Persian enables rigorous benchmarking, supports cross-lingual transfer, and provides a foundation for advancing multimodal speech research in underrepresented linguistic contexts. The dataset is publicly available at: this https URL.
100. 【2510.22706】IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
链接:https://arxiv.org/abs/2510.22706
作者:Hao Li,Zhengyu Zou,Fangfu Liu,Xuanyang Zhang,Fangzhou Hong,Yukang Cao,Yushi Lan,Manyuan Zhang,Gang Yu,Dingwen Zhang,Ziwei Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans naturally perceive, Humans naturally, world as intertwined, intertwined dimensions, naturally perceive
备注: [this https URL](https://github.com/lifuguan/IGGT_official)
点击查看摘要
Abstract:Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.
101. 【2510.22702】Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring
链接:https://arxiv.org/abs/2510.22702
作者:Mithul Chander,Sai Pragnya Ranga,Prathamesh Mayekar
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
关键词:Atlas Urban Index, Difference Built-up Index, Normalized Difference Built-up, Atlas Urban, measuring urban development
备注: An abridged version of this paper will be presented at and appear in the Proceedings of ACM IKDD CODS 2025
点击查看摘要
Abstract:We introduce the {\em Atlas Urban Index} (AUI), a metric for measuring urban development computed using Sentinel-2 \citep{spoto2012sentinel2} satellite imagery. Existing approaches, such as the {\em Normalized Difference Built-up Index} (NDBI), often struggle to accurately capture urban development due to factors like atmospheric noise, seasonal variation, and cloud cover. These limitations hinder large-scale monitoring of human development and urbanization. To address these challenges, we propose an approach that leverages {\em Vision-Language Models }(VLMs) to provide a development score for regions. Specifically, we collect a time series of Sentinel-2 images for each region. Then, we further process the images within fixed time windows to get an image with minimal cloud cover, which serves as the representative image for that time window. To ensure consistent scoring, we adopt two strategies: (i) providing the VLM with a curated set of reference images representing different levels of urbanization, and (ii) supplying the most recent past image to both anchor temporal consistency and mitigate cloud-related noise in the current image. Together, these components enable AUI to overcome the challenges of traditional urbanization indices and produce more reliable and stable development scores. Our qualitative experiments on Bangalore suggest that AUI outperforms standard indices such as NDBI.
102. 【2510.22697】WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing
链接:https://arxiv.org/abs/2510.22697
作者:Vittorio Bernuzzi,Leonardo Rossi,Tomaso Fontanini,Massimo Bertozzi,Andrea Prati
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:annotated data limits, fully supervised approaches, Self-supervised learning, Discrete Wavelet Transform, recently emerged
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully supervised approaches. In this work, we introduce WaveMAE, a masked autoencoding framework tailored for multispectral satellite imagery. Unlike conventional pixel-based reconstruction, WaveMAE leverages a multi-level Discrete Wavelet Transform (DWT) to disentangle frequency components and guide the encoder toward learning scale-aware high-frequency representations. We further propose a Geo-conditioned Positional Encoding (GPE), which incorporates geographical priors via Spherical Harmonics, encouraging embeddings that respect both semantic and geospatial structure. To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW-S2) and systematically evaluated on the diverse downstream tasks of the PANGAEA benchmark, spanning semantic segmentation, regression, change detection, and multilabel classification. Extensive experiments demonstrate that WaveMAE achieves consistent improvements over prior state-of-the-art approaches, with substantial gains on segmentation and regression benchmarks. The effectiveness of WaveMAE pretraining is further demonstrated by showing that even a lightweight variant, containing only 26.4% of the parameters, achieves state-of-the-art performance. Our results establish WaveMAE as a strong and geographically informed foundation model for multispectral remote sensing imagery.
103. 【2510.22694】Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2510.22694
作者:Shu Zhao,Tianyi Shen,Nilesh Ahuja,Omesh Tickoo,Vijaykrishnan Narayanan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Multimodal Large Language, Multimodal Retrieval-Augmented Generation, Language Models, Multimodal Large
备注: Accepted at NeurIPS 2025 UniReps Workshop
点击查看摘要
Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising method to generate factual and up-to-date responses of Multimodal Large Language Models (MLLMs) by incorporating non-parametric knowledge from external knowledge bases. However, existing MRAG approaches suffer from static retrieval strategies, inflexible modality selection, and suboptimal utilization of retrieved information, leading to three critical challenges: determining when to retrieve, what modality to incorporate, and how to utilize retrieved information effectively. To address these challenges, we introduce Windsock, a query-dependent module making decisions on retrieval necessity and modality selection, effectively reducing computational overhead and improving response quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction Tuning, an adaptive training strategy that enhances MLLMs' ability to utilize retrieved information while maintaining robustness against noise. Moreover, we adopt a self-assessment approach leveraging knowledge within MLLMs to convert question-answering datasets to MRAG training datasets. Extensive experiments demonstrate that our proposed method significantly improves the generation quality by 17.07% while reducing 8.95% retrieval times.
104. 【2510.22693】VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
链接:https://arxiv.org/abs/2510.22693
作者:Wenlong Li,Yifei Xu,Yuan Rao,Zhenhua Wang,Shuiguang Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:focuses on identifying, Generic Event, generic event nodes, Hierarchical Granularityaware Tree, Event Boundary Detection
备注: NeurIPS 2025 Camera Ready
点击查看摘要
Abstract:Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at this https URL.
105. 【2510.22684】RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance
链接:https://arxiv.org/abs/2510.22684
作者:Jiuniu Wang,Gongjie Zhang,Quanhao Qian,Junlong Gao,Deli Zhao,Ran Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Scalable Vector Graphics, Scalable Vector, Vector Graphics, robot control, fundamental to digital
备注: 15 pages, 5 figures
点击查看摘要
Abstract:Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments demonstrate that RoboSVG achieves superior query compliance and visual fidelity across tasks, establishing a new state of the art in versatile SVG generation. The dataset and source code of this project will be publicly available soon.
106. 【2510.22683】Estimation of Fireproof Structure Class and Construction Year for Disaster Risk Assessment
链接:https://arxiv.org/abs/2510.22683
作者:Hibiki Ayabe,Kazushi Okamoto,Koki Karube,Atsushi Shibata,Kei Harada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pricing in Japan, disaster risk assessment, risk assessment, Structural fireproof, Structural fireproof classification
备注:
点击查看摘要
Abstract:Structural fireproof classification is vital for disaster risk assessment and insurance pricing in Japan. However, key building metadata such as construction year and structure type are often missing or outdated, particularly in the second-hand housing market. This study proposes a multi-task learning model that predicts these attributes from facade images. The model jointly estimates the construction year, building structure, and property type, from which the structural fireproof class - defined as H (non-fireproof), T (semi-fireproof), or M (fireproof) - is derived via a rule-based mapping based on official insurance criteria. We trained and evaluated the model using a large-scale dataset of Japanese residential images, applying rigorous filtering and deduplication. The model achieved high accuracy in construction-year regression and robust classification across imbalanced categories. Qualitative analyses show that it captures visual cues related to building age and materials. Our approach demonstrates the feasibility of scalable, interpretable, image-based risk-profiling systems, offering potential applications in insurance, urban planning, and disaster preparedness.
107. 【2510.22675】DAMap: Distance-aware MapNet for High Quality HD Map Construction
链接:https://arxiv.org/abs/2510.22675
作者:Jinpeng Dong,Chen Li,Yutong Lin,Jingwen Fu,Sanping Zhou,Nanning Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Predicting High-definition, autonomous driving vehicles, localization scores, driving vehicles, high quality
备注: Accepted to ICCV2025
点击查看摘要
Abstract:Predicting High-definition (HD) map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions due to inherent task misalignment. Two main factors are responsible for misalignment: 1) inappropriate task labels due to one-to-many matching queries sharing the same labels, and 2) sub-optimal task features due to task-shared sampling mechanism. In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, the HLS is proposed to better utilize the advantages of the DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules. Code will be available at this https URL.
108. 【2510.22673】Alias-Free ViT: Fractional Shift Invariance via Linear Attention
链接:https://arxiv.org/abs/2510.22673
作者:Hagay Michaeli,Daniel Soudry
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:architectural inductive bias, Vision Transformers, vision tasks, lack the architectural, architectural inductive
备注: Accepted at NeurIPS 2025. Code is available at [this https URL](https://github.com/hmichaeli/alias_free_vit)
点击查看摘要
Abstract:Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets' translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.
109. 【2510.22672】Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
链接:https://arxiv.org/abs/2510.22672
作者:Anna Deichler,Jonas Beskow
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:studying referential communication, Meta Project Aria, exocentric perspectives, Project Aria smart, communication across egocentric
备注: 10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop on SPACE in Vision, Language, and Embodied AI (SpaVLE)
点击查看摘要
Abstract:We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.
110. 【2510.22669】LVD-GS: Gaussian Splatting SLAM for Dynamic Scenes via Hierarchical Explicit-Implicit Representation Collaboration Rendering
链接:https://arxiv.org/abs/2510.22669
作者:Wenkai Zhu,Xu Li,Qimin Xu,Benwu Wang,Kun Wei,Yiming Peng,Zihang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Gaussian Splatting SLAM, Gaussian Splatting, Splatting SLAM, Splatting SLAM system, spatial intelligence
备注:
点击查看摘要
Abstract:3D Gaussian Splatting SLAM has emerged as a widely used technique for high-fidelity mapping in spatial intelligence. However, existing methods often rely on a single representation scheme, which limits their performance in large-scale dynamic outdoor scenes and leads to cumulative pose errors and scale ambiguity. To address these challenges, we propose \textbf{LVD-GS}, a novel LiDAR-Visual 3D Gaussian Splatting SLAM system. Motivated by the human chain-of-thought process for information seeking, we introduce a hierarchical collaborative representation module that facilitates mutual reinforcement for mapping optimization, effectively mitigating scale drift and enhancing reconstruction robustness. Furthermore, to effectively eliminate the influence of dynamic objects, we propose a joint dynamic modeling module that generates fine-grained dynamic masks by fusing open-world segmentation with implicit residual constraints, guided by uncertainty estimates from DINO-Depth features. Extensive evaluations on KITTI, nuScenes, and self-collected datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.
111. 【2510.22665】SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery
链接:https://arxiv.org/abs/2510.22665
作者:Qiwei Ma,Zhiyu Wang,Wang Liu,Xukun Lu,Bin Deng,Puhong Duan,Xudong Kang,Shutao Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, crucial imaging modality, imaging modality due
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling (MIM) have paved the way for SAR foundation models, these approaches primarily focus on low-level visual features, often overlooking multimodal alignment and zero-shot target recognition within SAR imagery. To address this limitation, we construct SARCLIP-1M, a large-scale vision language dataset comprising over one million text-image pairs aggregated from existing datasets. We further introduce SARCLIP, the first vision language foundation model tailored for the SAR domain. Our SARCLIP model is trained using a contrastive vision language learning approach by domain transferring strategy, enabling it to bridge the gap between SAR imagery and textual descriptions. Extensive experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP in feature extraction and interpretation, significantly outperforming state-of-the-art foundation models and advancing the semantic understanding of SAR imagery. The code and datasets will be released soon.
112. 【2510.22650】Self-Attention Decomposition For Training Free Diffusion Editing
链接:https://arxiv.org/abs/2510.22650
作者:Tharun Anand,Mohammad Hassan Vali,Arno Solin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve remarkable fidelity, models achieve remarkable, editing remains challenging, remains challenging, Diffusion models achieve
备注: 4 pages (ICASSP Format)
点击查看摘要
Abstract:Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model's latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.
113. 【2510.22647】A Critical Study on Tea Leaf Disease Detection using Deep Learning Techniques
链接:https://arxiv.org/abs/2510.22647
作者:Nabajyoti Borah,Raju Moni Borah,Bandan Boruah,Purnendu Bikash Acharjee,Sajal Saha,Ripjyoti Hazarika
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deep Learning Technique, Red spider mite, Learning Technique, Faster R-CNN, SSD MobileNet
备注:
点击查看摘要
Abstract:The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.
114. 【2510.22630】Robust Atypical Mitosis Classification with DenseNet121: Stain-Aware Augmentation and Hybrid Loss for Domain Generalization
链接:https://arxiv.org/abs/2510.22630
作者:Adinath Dukre,Ankan Deria,Yutong Xie,Imran Razzak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reliable recognition remains, recognition remains challenging, remains challenging due, severe class imbalance, Atypical mitotic figures
备注: MIDOG 2025 MICCAI Workshop accepted
点击查看摘要
Abstract:Atypical mitotic figures are important biomarkers of tumor aggressiveness in histopathology, yet reliable recognition remains challenging due to severe class imbalance and variability across imaging domains. We present a DenseNet-121-based framework tailored for atypical mitosis classification in the MIDOG 2025 (Track 2) setting. Our method integrates stain-aware augmentation (Macenko), geometric and intensity transformations, and imbalance-aware learning via weighted sampling with a hybrid objective combining class-weighted binary cross-entropy and focal loss. Trained end-to-end with AdamW and evaluated across multiple independent domains, the model demonstrates strong generalization under scanner and staining shifts, achieving balanced accuracy 85.0%, AUROC 0.927, sensitivity 89.2%, and specificity 80.9% on the official test set. These results indicate that combining DenseNet-121 with stain-aware augmentation and imbalance-adaptive objectives yields a robust, domain-generalizable framework for atypical mitosis classification suitable for real-world computational pathology workflows.
115. 【2510.22622】DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection
链接:https://arxiv.org/abs/2510.22622
作者:Kangran Zhao,Yupeng Chen,Xiaoyu Zhang,Yize Chen,Weinan Guan,Baicheng Chen,Chengzhe Sun,Soumyya Kanti Datta,Qingshan Liu,Siwei Lyu,Baoyuan Wu
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:human-centric audiovisual content, substantial societal risks, poses substantial societal, forged human-centric audiovisual, multimodal deepfake
备注: Preprint
点击查看摘要
Abstract:The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.
116. 【2510.22618】Cross-Species Transfer Learning in Agricultural AI: Evaluating ZebraPose Adaptation for Dairy Cattle Pose Estimation
链接:https://arxiv.org/abs/2510.22618
作者:Mackenzie Tapp,Sibi Chakravarthy Parivendan,Kashfia Sailunaz,Suresh Neethirajan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Pose estimation serves, understanding animal posture, Pose estimation, animal posture, estimation serves
备注: 20 pages, 11 figures, 6 Tables
点击查看摘要
Abstract:Pose estimation serves as a cornerstone of computer vision for understanding animal posture, behavior, and welfare. Yet, agricultural applications remain constrained by the scarcity of large, annotated datasets for livestock, especially dairy cattle. This study evaluates the potential and limitations of cross-species transfer learning by adapting ZebraPose - a vision transformer-based model trained on synthetic zebra imagery - for 27-keypoint detection in dairy cows under real barn conditions. Using three configurations - a custom on-farm dataset (375 images, Sussex, New Brunswick, Canada), a subset of the APT-36K benchmark dataset, and their combination, we systematically assessed model accuracy and generalization across environments. While the combined model achieved promising performance (AP = 0.86, AR = 0.87, PCK 0.5 = 0.869) on in-distribution data, substantial generalization failures occurred when applied to unseen barns and cow populations. These findings expose the synthetic-to-real domain gap as a major obstacle to agricultural AI deployment and emphasize that morphological similarity between species is insufficient for cross-domain transfer. The study provides practical insights into dataset diversity, environmental variability, and computational constraints that influence real-world deployment of livestock monitoring systems. We conclude with a call for agriculture-first AI design, prioritizing farm-level realism, cross-environment robustness, and open benchmark datasets to advance trustworthy and scalable animal-centric technologies.
117. 【2510.22607】SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing
链接:https://arxiv.org/abs/2510.22607
作者:Yassh Ramchandani,Vijayashekhar S S,Jignesh S. Bhatt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present SWAN, Biorthogonal wavelet basis, joint estimation, input wavelet coefficients, Biorthogonal wavelet
备注:
点击查看摘要
Abstract:In this article, we present SWAN: a three-stage, self-supervised wavelet neural network for joint estimation of endmembers and abundances from hyperspectral imagery. The contiguous and overlapping hyperspectral band images are first expanded to Biorthogonal wavelet basis space that provides sparse, distributed, and multi-scale representations. The idea is to exploit latent symmetries from thus obtained invariant and covariant features using a self-supervised learning paradigm. The first stage, SWANencoder maps the input wavelet coefficients to a compact lower-dimensional latent space. The second stage, SWANdecoder uses the derived latent representation to reconstruct the input wavelet coefficients. Interestingly, the third stage SWANforward learns the underlying physics of the hyperspectral image. A three-stage combined loss function is formulated in the image acquisition domain that eliminates the need for ground truth and enables self-supervised training. Adam is employed for optimizing the proposed loss function, while Sigmoid with a dropout of 0.3 is incorporated to avoid possible overfitting. Kernel regularizers bound the magnitudes and preserve spatial variations in the estimated endmember coefficients. The output of SWANencoder represents estimated abundance maps during inference, while weights of SWANdecoder are retrieved to extract endmembers. Experiments are conducted on two benchmark synthetic data sets with different signal-to-noise ratios as well as on three real benchmark hyperspectral data sets while comparing the results with several state-of-the-art neural network-based unmixing methods. The qualitative, quantitative, and ablation results show performance enhancement by learning a resilient unmixing function as well as promoting self-supervision and compact network parameters for practical applications.
118. 【2510.22605】Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data
链接:https://arxiv.org/abs/2510.22605
作者:Yuang Wang,Pengfei Jin,Siyeop Yoon,Matthew Tivnan,Shaoyang Zhang,Li Zhang,Quanzheng Li,Zhiqiang Chen,Dufan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:Filtered Back Projection, projection data, remains challenging due, incomplete projection data, data remains challenging
备注: 53 pages, 7 figures, submitted to Medical Image Analysis
点击查看摘要
Abstract:Reconstructing CT images from incomplete projection data remains challenging due to the ill-posed nature of the problem. Diffusion bridge models have recently shown promise in restoring clean images from their corresponding Filtered Back Projection (FBP) reconstructions, but incorporating data consistency into these models remains largely underexplored. Incorporating data consistency can improve reconstruction fidelity by aligning the reconstructed image with the observed projection data, and can enhance detail recovery by integrating structural information contained in the projections. In this work, we propose the Projection Embedded Diffusion Bridge (PEDB). PEDB introduces a novel reverse stochastic differential equation (SDE) to sample from the distribution of clean images conditioned on both the FBP reconstruction and the incomplete projection data. By explicitly conditioning on the projection data in sampling the clean images, PEDB naturally incorporates data consistency. We embed the projection data into the score function of the reverse SDE. Under certain assumptions, we derive a tractable expression for the posterior score. In addition, we introduce a free parameter to control the level of stochasticity in the reverse process. We also design a discretization scheme for the reverse SDE to mitigate discretization error. Extensive experiments demonstrate that PEDB achieves strong performance in CT reconstruction from three types of incomplete data, including sparse-view, limited-angle, and truncated projections. For each of these types, PEDB outperforms evaluated state-of-the-art diffusion bridge models across standard, noisy, and domain-shift evaluations.
119. 【2510.22589】PSScreen V2: Partially Supervised Multiple Retinal Disease Screening
链接:https://arxiv.org/abs/2510.22589
作者:Boyi Zheng,Yalin Zheng,Hrvoje Bogunović,Qing Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:retinal disease screening, partially supervised self-training, multiple retinal disease, supervised self-training framework, disease screening
备注:
点击查看摘要
Abstract:In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at this https URL.
120. 【2510.22582】Cross-View UAV Geo-Localization with Precision-Focused Efficient Design: A Hierarchical Distillation Approach with Multi-view Refinement
链接:https://arxiv.org/abs/2510.22582
作者:Jian Sun,Kangdao Liu,Chi Zhang,Chuangquan Chen,Junge Shen,Chi-Man Vong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:geo-tagged satellite databases, matching aerial images, enables UAV localization, Cross-view geo-localization, satellite databases
备注:
点击查看摘要
Abstract:Cross-view geo-localization (CVGL) enables UAV localization by matching aerial images to geo-tagged satellite databases, which is critical for autonomous navigation in GNSS-denied environments. However, existing methods rely on resource-intensive fine-grained feature extraction and alignment, where multiple branches and modules significantly increase inference costs, limiting their deployment on edge devices. We propose Precision-Focused Efficient Design (PFED), a resource-efficient framework combining hierarchical knowledge transfer and multi-view representation refinement. This innovative method comprises two key components: 1) During training, Hierarchical Distillation paradigm for fast and accurate CVGL (HD-CVGL), coupled with Uncertainty-Aware Prediction Alignment (UAPA) to distill essential information and mitigate the data imbalance without incurring additional inference overhead. 2) During inference, an efficient Multi-view Refinement Module (MRM) leverages mutual information to filter redundant samples and effectively utilize the multi-view data. Extensive experiments show that PFED achieves state-of-the-art performance in both accuracy and efficiency, reaching 97.15\% Recall@1 on University-1652 while being over $5 \times$ more efficient in FLOPs and $3 \times$ faster than previous top methods. Furthermore, PFED runs at 251.5 FPS on the AGX Orin edge device, demonstrating its practical viability for real-time UAV applications. The project is available at this https URL
121. 【2510.22577】From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy
链接:https://arxiv.org/abs/2510.22577
作者:Feng He,Guodong Tan,Qiankun Li,Jun Yu,Quan Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high temporal resolution, single-exposure volumetric imaging, Light field microscopy, large-scale neural imaging, temporal resolution
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, notable for its single-exposure volumetric imaging, broad field of view, and high temporal resolution. However, learning-based 3D reconstruction in XLFM remains underdeveloped due to two core challenges: the absence of standardized datasets and the lack of methods that can efficiently model its angular-spatial structure while remaining physically grounded. We address these challenges by introducing three key contributions. First, we construct the XLFM-Zebrafish benchmark, a large-scale dataset and evaluation suite for XLFM reconstruction. Second, we propose Masked View Modeling for Light Fields (MVN-LF), a self-supervised task that learns angular priors by predicting occluded views, improving data efficiency. Third, we formulate the Optical Rendering Consistency Loss (ORC Loss), a differentiable rendering constraint that enforces alignment between predicted volumes and their PSF-based forward projections. On the XLFM-Zebrafish benchmark, our method improves PSNR by 7.7% over state-of-the-art baselines.
122. 【2510.22575】MELDAE: A Framework for Micro-Expression Spotting, Detection, and Automatic Evaluation in In-the-Wild Conversational Scenes
链接:https://arxiv.org/abs/2510.22575
作者:Yigui Feng,Qinglin Wang,Yang Liu,Ke Liu,Haotian Mo,Enhao Huang,Gencheng Liu,Mingzhe Liu,Jie Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurately analyzing spontaneous, true human emotions, revealing true human, task remains challenging, Accurately analyzing
备注:
点击查看摘要
Abstract:Accurately analyzing spontaneous, unconscious micro-expressions is crucial for revealing true human emotions, but this task remains challenging in wild scenarios, such as natural conversation. Existing research largely relies on datasets from controlled laboratory environments, and their performance degrades dramatically in the real world. To address this issue, we propose three contributions: the first micro-expression dataset focused on conversational-in-the-wild scenarios; an end-to-end localization and detection framework, MELDAE; and a novel boundary-aware loss function that improves temporal accuracy by penalizing onset and offset errors. Extensive experiments demonstrate that our framework achieves state-of-the-art results on the WDMD dataset, improving the key F1_{DR} localization metric by 17.72% over the strongest baseline, while also demonstrating excellent generalization capabilities on existing benchmarks.
123. 【2510.22571】STATUS Bench: A Rigorous Benchmark for Evaluating Object State Understanding in Vision-Language Models
链接:https://arxiv.org/abs/2510.22571
作者:Mahiro Ukai,Shuhei Kurita,Nakamasa Inoue
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:STATUS Bench, Object state, STATUS Bench introduces, STATUS, open or closed
备注:
点击查看摘要
Abstract:Object state recognition aims to identify the specific condition of objects, such as their positional states (e.g., open or closed) and functional states (e.g., on or off). While recent Vision-Language Models (VLMs) are capable of performing a variety of multimodal tasks, it remains unclear how precisely they can identify object states. To alleviate this issue, we introduce the STAte and Transition UnderStanding Benchmark (STATUS Bench), the first benchmark for rigorously evaluating the ability of VLMs to understand subtle variations in object states in diverse situations. Specifically, STATUS Bench introduces a novel evaluation scheme that requires VLMs to perform three tasks simultaneously: object state identification (OSI), image retrieval (IR), and state change identification (SCI). These tasks are defined over our fully hand-crafted dataset involving image pairs, their corresponding object state descriptions and state change descriptions. Furthermore, we introduce a large-scale training dataset, namely STATUS Train, which consists of 13 million semi-automatically created descriptions. This dataset serves as the largest resource to facilitate further research in this area. In our experiments, we demonstrate that STATUS Bench enables rigorous consistency evaluation and reveal that current state-of-the-art VLMs still significantly struggle to capture subtle object state distinctions. Surprisingly, under the proposed rigorous evaluation scheme, most open-weight VLMs exhibited chance-level zero-shot performance. After fine-tuning on STATUS Train, Qwen2.5-VL achieved performance comparable to Gemini 2.0 Flash. These findings underscore the necessity of STATUS Bench and Train for advancing object state recognition in VLM research.
124. 【2510.22534】SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning
链接:https://arxiv.org/abs/2510.22534
作者:Chen Chen,Majid Abdolshah,Violetta Shevchenko,Hongdong Li,Chang Xu,Pulak Purkait
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing diffusion-based super-resolution, Existing diffusion-based, exhibit semantic ambiguities, semantic ambiguities due, diffusion-based super-resolution approaches
备注: Accepted at NeurIPS 2025
点击查看摘要
Abstract:Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.
125. 【2510.22529】Bag-of-Word-Groups (BoWG): A Robust and Efficient Loop Closure Detection Method Under Perceptual Aliasing
链接:https://arxiv.org/abs/2510.22529
作者:Xiang Fei,Tina Tian,Howie Choset,Lu Li
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:ensure global mapping, Simultaneous Localization, reduce accumulative drift, global mapping consistency, critical in Simultaneous
备注: This paper has been accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
点击查看摘要
Abstract:Loop closure is critical in Simultaneous Localization and Mapping (SLAM) systems to reduce accumulative drift and ensure global mapping consistency. However, conventional methods struggle in perceptually aliased environments, such as narrow pipes, due to vector quantization, feature sparsity, and repetitive textures, while existing solutions often incur high computational costs. This paper presents Bag-of-Word-Groups (BoWG), a novel loop closure detection method that achieves superior precision-recall, robustness, and computational efficiency. The core innovation lies in the introduction of word groups, which captures the spatial co-occurrence and proximity of visual words to construct an online dictionary. Additionally, drawing inspiration from probabilistic transition models, we incorporate temporal consistency directly into similarity computation with an adaptive scheme, substantially improving precision-recall performance. The method is further strengthened by a feature distribution analysis module and dedicated post-verification mechanisms. To evaluate the effectiveness of our method, we conduct experiments on both public datasets and a confined-pipe dataset we constructed. Results demonstrate that BoWG surpasses state-of-the-art methods, including both traditional and learning-based approaches, in terms of precision-recall and computational efficiency. Our approach also exhibits excellent scalability, achieving an average processing time of 16 ms per image across 17,565 images in the Bicocca25b dataset.
126. 【2510.22528】AesCrop: Aesthetic-driven Cropping Guided by Composition
链接:https://arxiv.org/abs/2510.22528
作者:Yen-Hong Wong,Lai-Kuan Wong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impacts user engagement, significantly impacts user, appeal significantly impacts, Aesthetic-driven image cropping, visual appeal significantly
备注: Accepted at the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025
点击查看摘要
Abstract:Aesthetic-driven image cropping is crucial for applications like view recommendation and thumbnail generation, where visual appeal significantly impacts user engagement. A key factor in visual appeal is composition--the deliberate arrangement of elements within an image. Some methods have successfully incorporated compositional knowledge through evaluation-based and regression-based paradigms. However, evaluation-based methods lack globality while regression-based methods lack diversity. Recently, hybrid approaches that integrate both paradigms have emerged, bridging the gap between these two to achieve better diversity and globality. Notably, existing hybrid methods do not incorporate photographic composition guidance, a key attribute that defines photographic aesthetics. In this work, we introduce AesCrop, a composition-aware hybrid image-cropping model that integrates a VMamba image encoder, augmented with a novel Mamba Composition Attention Bias (MCAB) and a transformer decoder to perform end-to-end rank-based image cropping, generating multiple crops along with the corresponding quality scores. By explicitly encoding compositional cues into the attention mechanism, MCAB directs AesCrop to focus on the most compositionally salient regions. Extensive experiments demonstrate that AesCrop outperforms current state-of-the-art methods, delivering superior quantitative metrics and qualitatively more pleasing crops.
127. 【2510.22521】Open Multimodal Retrieval-Augmented Factual Image Generation
链接:https://arxiv.org/abs/2510.22521
作者:Yang Tian,Fan Liu,Jingyuan Zhang,Wei Bi,Yupeng Hu,Liqiang Nie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Multimodal Models, achieved remarkable progress, involve fine-grained attributes, contradict verifiable knowledge, prompts involve fine-grained
备注: Preprint
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
128. 【2510.22507】GateFuseNet: An Adaptive 3D Multimodal Neuroimaging Fusion Network for Parkinson's Disease Diagnosis
链接:https://arxiv.org/abs/2510.22507
作者:Rui Jin,Chen Chen,Yin Liu,Hongfu Sun,Min Zeng,Min Li,Yang Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:MRI remains challenging, remains challenging due, Quantitative Susceptibility Mapping, Parkinson disease, Accurate diagnosis
备注: The first two authors contributed equally to this work. Correspondence to: Yang Gao, E-mail: [this http URL](http://yang.gao) @csu. [this http URL](http://edu.cn)
点击查看摘要
Abstract:Accurate diagnosis of Parkinson's disease (PD) from MRI remains challenging due to symptom variability and pathological heterogeneity. Most existing methods rely on conventional magnitude-based MRI modalities, such as T1-weighted images (T1w), which are less sensitive to PD pathology than Quantitative Susceptibility Mapping (QSM), a phase-based MRI technique that quantifies iron deposition in deep gray matter nuclei. In this study, we propose GateFuseNet, an adaptive 3D multimodal fusion network that integrates QSM and T1w images for PD diagnosis. The core innovation lies in a gated fusion module that learns modality-specific attention weights and channel-wise gating vectors for selective feature modulation. This hierarchical gating mechanism enhances ROI-aware features while suppressing irrelevant signals. Experimental results show that our method outperforms three existing state-of-the-art approaches, achieving 85.00% accuracy and 92.06% AUC. Ablation studies further validate the contributions of ROI guidance, multimodal integration, and fusion positioning. Grad-CAM visualizations confirm the model's focus on clinically relevant pathological regions. The source codes and pretrained models can be found at this https URL
129. 【2510.22491】LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation
链接:https://arxiv.org/abs/2510.22491
作者:Ghadi Nehme,Yanxia Zhang,Dule Shu,Matt Klenk,Faez Ahmed
类目:Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating high-fidelity, specific parameter constraints, satisfy specific parameter, Linear Affine Mixing, satisfy specific
备注:
点击查看摘要
Abstract:Generating high-fidelity 3D geometries that satisfy specific parameter constraints has broad applications in design and engineering. However, current methods typically rely on large training datasets and struggle with controllability and generalization beyond the training distributions. To overcome these limitations, we introduce LAMP (Linear Affine Mixing of Parametric shapes), a data-efficient framework for controllable and interpretable 3D generation. LAMP first aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then synthesizes new geometries by solving a parameter-constrained mixing problem in the aligned weight space. To ensure robustness, we further propose a safety metric that detects geometry validity via linearity mismatch. We evaluate LAMP on two 3D parametric benchmarks: DrivAerNet++ and BlendedNet. We found that LAMP enables (i) controlled interpolation within bounds with as few as 100 samples, (ii) safe extrapolation by up to 100% parameter difference beyond training ranges, (iii) physics performance-guided optimization under fixed parameters. LAMP significantly outperforms conditional autoencoder and Deep Network Interpolation (DNI) baselines in both extrapolation and data efficiency. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.
130. 【2510.22480】Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity
链接:https://arxiv.org/abs/2510.22480
作者:Seonghoon Yu,Dongjun Nam,Dina Katabi,Jeany Son
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:lightweight student model, aims to train, train a lightweight, model by transferring, teacher
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance; however, achieving such diversity typically requires multiple teacher networks, leading to high computational costs. In this work, we propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher. To ensure meaningful semantic variation across multi-views, we introduce two angular diversity objectives: 1) constrained inter-angle diversify loss, which maximizes angles between augmented views while preserving proximity to the original teacher output, and 2) intra-angle diversify loss, which encourages an even distribution of views around the original output. The ensembled knowledge from these angularly diverse views, along with the original teacher, is distilled into the student. We further theoretically demonstrate that our objectives increase the diversity among ensemble members and thereby reduce the upper bound of the ensemble's expected loss, leading to more effective distillation. Experimental results show that our method surpasses an existing knowledge augmentation method across diverse configurations. Moreover, the proposed method is compatible with other KD frameworks in a plug-and-play fashion, providing consistent improvements in generalization performance.
131. 【2510.22473】DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss
链接:https://arxiv.org/abs/2510.22473
作者:Jing Yang,Yufeng Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent advancements, expanded the capabilities, Gaussian Splatting, Recent, generative models
备注:
点击查看摘要
Abstract:Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a significant challenge. Traditional methods have limitations in modeling temporal dependencies and accurately capturing dynamic geometry changes, especially when considering variations in camera perspective. To address this issue, we propose DynaPose4D, an innovative solution that integrates 4D Gaussian Splatting (4DGS) techniques with Category-Agnostic Pose Estimation (CAPE) technology. This framework uses 3D Gaussian Splatting to construct a 3D model from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency. Experimental results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation. These findings not only validate the efficacy of the DynaPose4D framework but also indicate its potential applications in the domains of computer vision and animation production.
132. 【2510.22454】SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning
链接:https://arxiv.org/abs/2510.22454
作者:Linhan Wang,Jianwen Dou,Wang Li,Shengkun Wang,Zhiwu Xie,Chang-Tien Lu,Yinlin Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cryogenic Electron Tomography, Cryogenic Electron, Electron Tomography, imaging modality capable, structures inside cells
备注:
点击查看摘要
Abstract:Cryogenic Electron Tomography (CryoET) combined with sub-volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Particle picking, the task of localizing and classifying target proteins in 3D CryoET volumes, remains the main bottleneck. Due to the reliance on time-consuming manual labels, the vast reserve of unlabeled tomograms remains underutilized. In this work, we present a fast, label-efficient semi-supervised framework that exploits this untapped data. Our framework consists of two components: (i) an end-to-end heatmap-supervised detection model inspired by keypoint detection, and (ii) a teacher-student co-training mechanism that enhances performance under sparse labeling conditions. Furthermore, we introduce multi-view pseudo-labeling and a CryoET-specific DropBlock augmentation strategy to further boost performance. Extensive evaluations on the large-scale CZII dataset show that our approach improves F1 by 10% over supervised baselines, underscoring the promise of semi-supervised learning for leveraging unlabeled CryoET data.
133. 【2510.22443】Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents
链接:https://arxiv.org/abs/2510.22443
作者:Vijay Veerabadran,Fanyi Xiao,Nitin Kamra,Pedro Matias,Joy Chen,Caley Drooff,Brett D Roads,Riley Williams,Ethan Henderson,Xuanyi Zhao,Kevin Carlberg,Joseph Tighe,Karl Ridgeway
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:wearable form factors, assistive wearable agents, smart glasses, assistive wearable, form factors
备注: Accepted as a spotlight paper at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.
134. 【2510.22436】3D Roadway Scene Object Detection with LIDARs in Snowfall Conditions
链接:https://arxiv.org/abs/2510.22436
作者:Ghazal Farhani,Taufiq Rahman,Syed Mostaquim Ali,Andrew Liu,Mohamed Zaki,Dominique Charlebois,Benoit Anctil
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:obtain exceptional situational, exceptional situational awareness, Light Detection, autonomous driving systems, roadway environment
备注: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), pp. 1441--1448, Sept. 2024
点击查看摘要
Abstract:Because 3D structure of a roadway environment can be characterized directly by a Light Detection and Ranging (LiDAR) sensors, they can be used to obtain exceptional situational awareness for assitive and autonomous driving systems. Although LiDARs demonstrate good performance in clean and clear weather conditions, their performance significantly deteriorates in adverse weather conditions such as those involving atmospheric precipitation. This may render perception capabilities of autonomous systems that use LiDAR data in learning based models to perform object detection and ranging ineffective. While efforts have been made to enhance the accuracy of these models, the extent of signal degradation under various weather conditions remains largely not quantified. In this study, we focus on the performance of an automotive grade LiDAR in snowy conditions in order to develop a physics-based model that examines failure modes of a LiDAR sensor. Specifically, we investigated how the LiDAR signal attenuates with different snowfall rates and how snow particles near the source serve as small but efficient reflectors. Utilizing our model, we transform data from clear conditions to simulate snowy scenarios, enabling a comparison of our synthetic data with actual snowy conditions. Furthermore, we employ this synthetic data, representative of different snowfall rates, to explore the impact on a pre-trained object detection model, assessing its performance under varying levels of snowfall
135. 【2510.22431】Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration
链接:https://arxiv.org/abs/2510.22431
作者:Zheng Wei,Mingchen Li,Zeqian Zhang,Ruibin Yuan,Pan Hui,Huamin Qu,James Evans,Maneesh Agrawala,Anyi Rao
类目:Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated significant potential, Recent advancements, long video generation, demonstrated significant, significant potential
备注:
点击查看摘要
Abstract:Recent advancements in multi-agent systems have demonstrated significant potential for enhancing creative task performance, such as long video generation. This study introduces three innovations to improve multi-agent collaboration. First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation that leverages a film-production-inspired architecture to enable modular specialization and scalable inter-agent collaboration. Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions among agents lacking sufficient context, reducing individual memory requirements while ensuring adequate contextual information. Third, we transition from directed acyclic graphs (DAGs) to directed cyclic graphs with limited retries, allowing agents to reflect and refine outputs iteratively, thereby improving earlier stages through feedback from subsequent nodes. These contributions lay the groundwork for developing more robust multi-agent systems in creative tasks.
136. 【2510.22391】op-Down Semantic Refinement for Image Captioning
链接:https://arxiv.org/abs/2510.22391
作者:Jusheng Zhang,Kaitong Cai,Jing Yang,Jian Wang,Chengpei Tang,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, myopic decision-making process, powerful single-step generation, single-step generation capabilities, Large Vision-Language
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.
137. 【2510.22390】A Fully Interpretable Statistical Approach for Roadside LiDAR Background Subtraction
链接:https://arxiv.org/abs/2510.22390
作者:Aitor Iglesias,Nerea Aranjuelo,Patricia Javierre,Ainhoa Menendez,Ignacio Arganda-Carreras,Marcos Nieto
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enhancing infrastructure-based perception, flexible statistical method, aimed at enhancing, automated driving, roadside LiDAR data
备注:
点击查看摘要
Abstract:We present a fully interpretable and flexible statistical method for background subtraction in roadside LiDAR data, aimed at enhancing infrastructure-based perception in automated driving. Our approach introduces both a Gaussian distribution grid (GDG), which models the spatial statistics of the background using background-only scans, and a filtering algorithm that uses this representation to classify LiDAR points as foreground or background. The method supports diverse LiDAR types, including multiline 360 degree and micro-electro-mechanical systems (MEMS) sensors, and adapts to various configurations. Evaluated on the publicly available RCooper dataset, it outperforms state-of-the-art techniques in accuracy and flexibility, even with minimal background data. Its efficient implementation ensures reliable performance on low-resource hardware, enabling scalable real-world deployment.
138. 【2510.22387】Privacy-Aware Federated nnU-Net for ECG Page Digitization
链接:https://arxiv.org/abs/2510.22387
作者:Nader Nemati
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Deep neural networks, Deep neural, convert ECG page, analyzable waveforms, deployment constraints
备注:
点击查看摘要
Abstract:Deep neural networks can convert ECG page images into analyzable waveforms, yet centralized training often conflicts with cross-institutional privacy and deployment constraints. A cross-silo federated digitization framework is presented that trains a full-model nnU-Net segmentation backbone without sharing images and aggregates updates across sites under realistic non-IID heterogeneity (layout, grid style, scanner profile, noise). The protocol integrates three standard server-side aggregators--FedAvg, FedProx, and FedAdam--and couples secure aggregation with central, user-level differential privacy to align utility with formal guarantees. Key features include: (i) end-to-end full-model training and synchronization across clients; (ii) secure aggregation so the server only observes a clipped, weighted sum once a participation threshold is met; (iii) central Gaussian DP with Renyi accounting applied post-aggregation for auditable user-level privacy; and (iv) a calibration-aware digitization pipeline comprising page normalization, trace segmentation, grid-leakage suppression, and vectorization to twelve-lead signals. Experiments on ECG pages rendered from PTB-XL show consistently faster convergence and higher late-round plateaus with adaptive server updates (FedAdam) relative to FedAvg and FedProx, while approaching centralized performance. The privacy mechanism maintains competitive accuracy while preventing exposure of raw images or per-client updates, yielding deployable, auditable guarantees suitable for multi-institution settings.
Subjects:
Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2510.22387 [cs.CR]
(or
arXiv:2510.22387v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2510.22387
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
139. 【2510.22383】Dynamic Dropout: Leveraging Conway's Game of Life for Neural Networks Regularization
链接:https://arxiv.org/abs/2510.22383
作者:David Freire-Obregón,José Salas-Cáceres,Modesto Castrillón-Santana
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Regularization techniques play, play a crucial, crucial role, role in preventing, preventing overfitting
备注: Accepted for presentation at the 5th International Conference on Computing and Machine Intelligence (ICMI 2026)
点击查看摘要
Abstract:Regularization techniques play a crucial role in preventing overfitting and improving the generalization performance of neural networks. Dropout, a widely used regularization technique, randomly deactivates units during training to introduce redundancy and prevent co-adaptation among neurons. Despite its effectiveness, dropout has limitations, such as its static nature and lack of interpretability. In this paper, we propose a novel approach to regularization by substituting dropout with Conway's Game of Life (GoL), a cellular automata with simple rules that govern the evolution of a grid of cells. We introduce dynamic unit deactivation during training by representing neural network units as cells in a GoL grid and applying the game's rules to deactivate units. This approach allows for the emergence of spatial patterns that adapt to the training data, potentially enhancing the network's ability to generalize. We demonstrate the effectiveness of our approach on the CIFAR-10 dataset, showing that dynamic unit deactivation using GoL achieves comparable performance to traditional dropout techniques while offering insights into the network's behavior through the visualization of evolving patterns. Furthermore, our discussion highlights the applicability of our proposal in deeper architectures, demonstrating how it enhances the performance of different dropout techniques.
140. 【2510.22380】Efficient Large-Deformation Medical Image Registration via Recurrent Dynamic Correlation
链接:https://arxiv.org/abs/2510.22380
作者:Tianran Li,Marius Staring,Yuchuan Qiao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deformable image registration, image registration estimates, registration estimates voxel-wise, Deformable image, estimates voxel-wise correspondences
备注:
点击查看摘要
Abstract:Deformable image registration estimates voxel-wise correspondences between images through spatial transformations, and plays a key role in medical imaging. While deep learning methods have significantly reduced runtime, efficiently handling large deformations remains a challenging task. Convolutional networks aggregate local features but lack direct modeling of voxel correspondences, promoting recent works to explore explicit feature matching. Among them, voxel-to-region matching is more efficient for direct correspondence modeling by computing local correlation features whithin neighbourhoods, while region-to-region matching incurs higher redundancy due to excessive correlation pairs across large regions. However, the inherent locality of voxel-to-region matching hinders the capture of long-range correspondences required for large deformations. To address this, we propose a Recurrent Correlation-based framework that dynamically relocates the matching region toward more promising positions. At each step, local matching is performed with low cost, and the estimated offset guides the next search region, supporting efficient convergence toward large deformations. In addition, we uses a lightweight recurrent update module with memory capacity and decouples motion-related and texture features to suppress semantic redundancy. We conduct extensive experiments on brain MRI and abdominal CT datasets under two settings: with and without affine pre-registration. Results show that our method exibits a strong accuracy-computation trade-off, surpassing or matching the state-of-the-art performance. For example, it achieves comparable performance on the non-affine OASIS dataset, while using only 9.5% of the FLOPs and running 96% faster than RDP, a representative high-performing method.
141. 【2510.22373】VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
链接:https://arxiv.org/abs/2510.22373
作者:Yupeng Xie,Zhiyang Zhang,Yifan Wu,Sirong Lu,Jiayi Zhang,Zhaoyang Yu,Jinlin Wang,Sirui Hong,Bang Liu,Chenglin Wu,Yuyu Luo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:turn complex datasets, form of imagery, intuitive insights, faithfully represented, domain-specific yet widely
备注: 53 pages, 26 figures, 5 tables
点击查看摘要
Abstract:Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.551 and a correlation with human ratings of only 0.429. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.442 (a 19.8% reduction) and increasing the consistency with human experts to 0.681 (a 58.7% improvement) compared to GPT-5. The benchmark is available at this https URL.
142. 【2510.22370】BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles
链接:https://arxiv.org/abs/2510.22370
作者:Seyed Ahmad Hosseini Miangoleh,Amin Jalal Aghdasian,Farzaneh Abdollahi
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:Bootstrapped Language-Image Pretraining-driven, Proximal Policy Optimization, Language-Image Pretraining-driven Fused, Pretraining-driven Fused State, propose Bootstrapped Language-Image
备注: [this https URL](https://github.com/Amin-A96/BLIP-FusePPO-A-Vision-Language-Deep-Reinforcement-Learning-Framework-for-Lane-Keeping-in-Autonomous.git)
点击查看摘要
Abstract:In this paper, we propose Bootstrapped Language-Image Pretraining-driven Fused State Representation in Proximal Policy Optimization (BLIP-FusePPO), a novel multimodal reinforcement learning (RL) framework for autonomous lane-keeping (LK), in which semantic embeddings generated by a vision-language model (VLM) are directly fused with geometric states, LiDAR observations, and Proportional-Integral-Derivative-based (PID) control feedback within the agent observation space. The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand by combining high-level scene understanding from the VLM with low-level control and spatial signals. Our architecture brings together semantic, geometric, and control-aware representations to make policy learning more robust. A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable. Our method is different from the approaches that only use semantic models to shape rewards. Instead, it directly embeds semantic features into the state representation. This cuts down on expensive runtime inference and makes sure that semantic guidance is always available. The simulation results show that the proposed model is better at LK stability and adaptability than the best vision-based and multimodal RL baselines in a wide range of difficult driving situations. We make our code publicly available.
143. 【2510.22366】2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models
链接:https://arxiv.org/abs/2510.22366
作者:Jindong Yang,Han Fang,Weiming Zhang,Nenghai Yu,Kejiang Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:intellectual property protection, producing high-fidelity images, Diffusion models, recent years, producing high-fidelity
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{this https URL}{this https URL}.
144. 【2510.22359】EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model
链接:https://arxiv.org/abs/2510.22359
作者:Changhao Zhang,Matthew J. Clarkson,Mobarak I. Hoque
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surgery scenes plays, enhancing scene perception, supporting context-aware decision-making, endoscopic surgery scenes, surgery scenes
备注: 11 pages
点击查看摘要
Abstract:3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery. A critical yet challenging step in this process is the accurate estimation of the endoscope's intrinsic parameters. In real surgical settings, intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation. Most existing methods for endoscopic 3D reconstruction do not estimate intrinsic parameters, limiting their effectiveness for accurate and reliable reconstruction. In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction. We introduce an attention-based pose network and a Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning of DA2. Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction. Code and model weights can be found in project repository: this https URL.
145. 【2510.22340】DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
链接:https://arxiv.org/abs/2510.22340
作者:Changti Wu,Shijie Lian,Zihao Liu,Lei Zhang,Laurence Tianruo Yang,Kai Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Solid geometry problem, problem solving demands, geometry problem solving, solving demands spatial, Solid geometry
备注: The code and dataset are available at \href{ [this https URL](https://zgca-ai4edu.github.io/DynaSolidGeo/) }{DynaSolidGeo}
点击查看摘要
Abstract:Solid geometry problem solving demands spatial mathematical reasoning that integrates spatial intelligence and symbolic reasoning. However, most existing multimodal mathematical reasoning benchmarks focus primarily on 2D plane geometry, rely on static datasets prone to data contamination and memorization, and evaluate models solely by final answers, overlooking the reasoning process. To address these limitations, we introduce DynaSolidGeo, the first dynamic benchmark for evaluating genuine spatial reasoning in Vision-Language Models (VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo contains 503 expert-curated seed questions that can, in principle, dynamically generate an unbounded number of diverse multimodal text-visual instances. Beyond answer accuracy, we incorporate process evaluation based on expert-annotated reasoning chains to measure logical validity and causal coherence. Experiments across representative open-source and closed-source VLMs reveal large performance gaps, severe degradation in dynamic settings, and poor performance on tasks requiring high-level spatial intelligence, such as mental rotation and visualization. The code and dataset are available at \href{this https URL}{DynaSolidGeo}.
146. 【2510.22337】GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation
链接:https://arxiv.org/abs/2510.22337
作者:Phillip Mueller,Talip Uenlue,Sebastian Schmidt,Marcel Kollovieh,Jiajie Fan,Stephan Guennemann,Lars Mikelsons
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:essential for engineering, Precise geometric control, creative industries, object features accurately, geometric
备注:
点击查看摘要
Abstract:Precise geometric control in image generation is essential for engineering \ product design and creative industries to control 3D object features accurately in image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.
147. 【2510.22335】Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction
链接:https://arxiv.org/abs/2510.22335
作者:Xu Zhang,Ruijie Quan,Wenguan Wang,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Reconstructing visual stimuli, central challenge bridging, challenge bridging machine, bridging machine learning, learning and neuroscience
备注:
点击查看摘要
Abstract:Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.
148. 【2510.22322】Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning
链接:https://arxiv.org/abs/2510.22322
作者:Ali Javidani,Babak Nadjar Araabi,Mohammad Amin Sadeghi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:self-supervised representation learning, integrates graph theory, paper introduces, approach that integrates, theory into self-supervised
备注: Accepted in IEEE Signal Processing Letters, 2025
点击查看摘要
Abstract:This paper introduces a novel approach that integrates graph theory into self-supervised representation learning. Traditional methods focus on intra-instance variations generated by applying augmentations. However, they often overlook important inter-instance relationships. While our method retains the intra-instance property, it further captures inter-instance relationships by constructing k-nearest neighbor (KNN) graphs for both teacher and student streams during pretraining. In these graphs, nodes represent samples along with their latent representations. Edges encode the similarity between instances. Following pretraining, a representation refinement phase is performed. In this phase, Graph Neural Networks (GNNs) propagate messages not only among immediate neighbors but also across multiple hops, thereby enabling broader contextual integration. Experimental results on CIFAR-10, ImageNet-100, and ImageNet-1K demonstrate accuracy improvements of 7.3%, 3.2%, and 1.0%, respectively, over state-of-the-art methods. These results highlight the effectiveness of the proposed graph based mechanism. The code is publicly available at this https URL.
149. 【2510.22319】GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
链接:https://arxiv.org/abs/2510.22319
作者:Jing Wang,Jiajun Liang,Jie Liu,Henglin Liu,Gongye Liu,Jun Zheng,Wanyuan Pang,Ao Ma,Zhenyu Xie,Xintao Wang,Meng Wang,Pengfei Wan,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:GRPO-based reinforcement learning, shown remarkable progress, optimizing flow-matching models, GRPO-based reinforcement, reinforcement learning
备注:
点击查看摘要
Abstract:Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.
150. 【2510.22300】2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model
链接:https://arxiv.org/abs/2510.22300
作者:Chenyu Zhang,Tairen Zhang,Lanjun Wang,Ruidong Chen,Wenhui Li,Anan Liu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:pornography and violent, risky, risky text prompts, critical task, models
备注: AAAI under review
点击查看摘要
Abstract:Using risky text prompts, such as pornography and violent prompts, to test the safety of text-to-image (T2I) models is a critical task. However, existing risky prompt datasets are limited in three key areas: 1) limited risky categories, 2) coarse-grained annotation, and 3) low effectiveness. To address these limitations, we introduce T2I-RiskyPrompt, a comprehensive benchmark designed for evaluating safety-related tasks in T2I models. Specifically, we first develop a hierarchical risk taxonomy, which consists of 6 primary categories and 14 fine-grained subcategories. Building upon this taxonomy, we construct a pipeline to collect and annotate risky prompts. Finally, we obtain 6,432 effective risky prompts, where each prompt is annotated with both hierarchical category labels and detailed risk reasons. Moreover, to facilitate the evaluation, we propose a reason-driven risky image detection method that explicitly aligns the MLLM with safety annotations. Based on T2I-RiskyPrompt, we conduct a comprehensive evaluation of eight T2I models, nine defense methods, five safety filters, and five attack strategies, offering nine key insights into the strengths and limitations of T2I model safety. Finally, we discuss potential applications of T2I-RiskyPrompt across various research fields. The dataset and code are provided in this https URL.
151. 【2510.22282】CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
链接:https://arxiv.org/abs/2510.22282
作者:Tianhui Liu,Hetian Pang,Xin Zhang,Jie Feng,Yong Li,Pan Hui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:sustainable development goals, achieving global sustainable, global sustainable development, Harnessing publicly, large-scale web data
备注:
点击查看摘要
Abstract:Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
152. 【2510.22276】WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models
链接:https://arxiv.org/abs/2510.22276
作者:Issa Sugiura,Shuhei Kurita,Yusuke Oda,Daisuke Kawahara,Yasuo Okabe,Naoaki Okazaki
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:high-quality image-text pair, Japanese image-text pair, Large-scale and high-quality, developing high-performing Vision-Language, high-quality Japanese image-text
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Large-scale and high-quality image-text pair datasets play an important role in developing high-performing Vision-Language Models (VLMs). In this work, we introduce WAON, a large-scale and high-quality Japanese image-text pair dataset containing approximately 155 million examples, collected from Common Crawl. Our dataset construction pipeline employs various techniques, including filtering and deduplication, which have been shown to be effective in previous studies. To evaluate its effectiveness, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification, consisting of 374 classes. To assess the effectiveness of our dataset, we conduct experiments using both WAON and the Japanese subset of ReLAION, one of the most widely used vision-language datasets. We fine-tune SigLIP2, a strong multilingual model, on both datasets. The results demonstrate that WAON enhances model performance on WAON-Bench more efficiently than ReLAION and achieves higher accuracy across all evaluated benchmarks. Furthermore, the model fine-tuned on WAON achieves state-of-the-art performance on several Japanese cultural benchmarks. We release our dataset, model, and code at this https URL.
153. 【2510.22268】GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification
链接:https://arxiv.org/abs/2510.22268
作者:Qiao Li,Jie Li,Yukang Zhang,Lei Tan,Jing Chen,Jiayi Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ground-based surveillance cameras, unmanned aerial vehicles, Aerial-Ground person re-identification, match pedestrian images, pedestrian images captured
备注: Accepted by Neurips 2025
点击查看摘要
Abstract:Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8\% in mAP and +16.8\% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting. The code is available at: \textcolor{magenta}{this https URL}.
154. 【2510.22260】Accident Anticipation via Temporal Occurrence Prediction
链接:https://arxiv.org/abs/2510.22260
作者:Tianhao Zhao,Yiyang Zou,Zihao Mao,Peilun Xiao,Yulin Huang,Hongda Yang,Yuxuan Li,Qun Li,Guobin Wu,Yutian Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling timely alerts, enhance road safety, predict potential collisions, Accident anticipation aims, online manner
备注: Accepted by NIPS 2025
点击查看摘要
Abstract:Accident anticipation aims to predict potential collisions in an online manner, enabling timely alerts to enhance road safety. Existing methods typically predict frame-level risk scores as indicators of hazard. However, these approaches rely on ambiguous binary supervision (labeling all frames in accident videos as positive) despite the fact that risk varies continuously over time, leading to unreliable learning and false alarms. To address this, we propose a novel paradigm that shifts the prediction target from current-frame risk scoring to directly estimating accident scores at multiple future time steps (e.g., 0.1s-2.0s ahead), leveraging precisely annotated accident timestamps as supervision. Our method employs a snippet-level encoder to jointly model spatial and temporal dynamics, and a Transformer-based temporal decoder that predicts accident scores for all future horizons simultaneously using dedicated temporal queries. Furthermore, we introduce a refined evaluation protocol that reports Time-to-Accident (TTA) and recall (evaluated at multiple pre-accident intervals (0.5s, 1.0s, and 1.5s)) only when the false alarm rate (FAR) remains within an acceptable range, ensuring practical relevance. Experiments show that our method achieves superior performance in both recall and TTA under realistic FAR constraints.
155. 【2510.22243】Real-Time Semantic Segmentation on FPGA for Autonomous Vehicles Using LMIINet with the CGRA4ML Framework
链接:https://arxiv.org/abs/2510.22243
作者:Amir Mohammad Khadem Hosseini,Sattar Mirzakuchaki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:computer vision, gaining particular importance, autonomous driving, fundamental problem, problem in computer
备注:
点击查看摘要
Abstract:Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real-time applications such as autonomous driving. The main challenge is achieving high accuracy while operating under computational and hardware constraints. In this research, we present an FPGA-based implementation of real-time semantic segmentation leveraging the lightweight LMIINet architecture and the Coarse-Grained Reconfigurable Array for Machine Learning (CGRA4ML) hardware framework. The model was trained using Quantization-Aware Training (QAT) with 8-bit precision on the Cityscapes dataset, reducing memory footprint by a factor of four while enabling efficient fixed-point computations. Necessary modifications were applied to adapt the model to CGRA4ML constraints, including simplifying skip connections, employing hardware-friendly operations such as depthwise-separable and 1A-1 convolutions, and redesigning parts of the Flatten Transformer. Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection-over-Union (mIoU), operating in real-time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board. The results demonstrate the potential of CGRA4ML, with its flexibility in mapping modern layers and off-chip memory utilization for skip connections, provides a path for implementing advanced semantic segmentation networks on FPGA for real-time applications to outperform traditional GPU solutions in terms of power efficiency while maintaining competitive accuracy. The code for this project is publicly available at this https URL cgra4ml_semantic_segmentation
156. 【2510.22236】DiffusionLane: Diffusion Model for Lane Detection
链接:https://arxiv.org/abs/2510.22236
作者:Kunyang Zhou,Yeqin Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:noisy lane anchors, lane anchors, denoising diffusion process, lane detection task, noisy lane
备注:
点击查看摘要
Abstract:In this paper, we present a novel diffusion-based model for lane detection, called DiffusionLane, which treats the lane detection task as a denoising diffusion process in the parameter space of the lane. Firstly, we add the Gaussian noise to the parameters (the starting point and the angle) of ground truth lanes to obtain noisy lane anchors, and the model learns to refine the noisy lane anchors in a progressive way to obtain the target lanes. Secondly, we propose a hybrid decoding strategy to address the poor feature representation of the encoder, resulting from the noisy lane anchors. Specifically, we design a hybrid diffusion decoder to combine global-level and local-level decoders for high-quality lane anchors. Then, to improve the feature representation of the encoder, we employ an auxiliary head in the training stage to adopt the learnable lane anchors for enriching the supervision on the encoder. Experimental results on four benchmarks, Carlane, Tusimple, CULane, and LLAMAS, show that DiffusionLane possesses a strong generalization ability and promising detection performance compared to the previous state-of-the-art methods. For example, DiffusionLane with ResNet18 surpasses the existing methods by at least 1\% accuracy on the domain adaptation dataset Carlane. Besides, DiffusionLane with MobileNetV4 gets 81.32\% F1 score on CULane, 96.89\% accuracy on Tusimple with ResNet34, and 97.59\% F1 score on LLAMAS with ResNet101. Code will be available at this https URL.
157. 【2510.22229】Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation
链接:https://arxiv.org/abs/2510.22229
作者:Jeongin Kim,Wonho Bae,YouLee Han,Giyeong Oh,Youngjae Yu,Danica J. Sutherland,Junhyug Noh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:constrained labeling budgets, demands dense pixel-level, extremely constrained labeling, Semantic segmentation demands, segmentation demands dense
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at this https URL.
158. 【2510.22225】Audio Frequency-Time Dual Domain Evaluation on Depression Diagnosis
链接:https://arxiv.org/abs/2510.22225
作者:Yu Luo,Nan Huang,Sophie Yu,Hendry Xu,Jerry Wang,Colin Wang,Zhichao Liu,Chen Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typical mental disorder, impacting public health, significantly impacting public, prevalent issue significantly, issue significantly impacting
备注:
点击查看摘要
Abstract:Depression, as a typical mental disorder, has become a prevalent issue significantly impacting public health. However, the prevention and treatment of depression still face multiple challenges, including complex diagnostic procedures, ambiguous criteria, and low consultation rates, which severely hinder timely assessment and intervention. To address these issues, this study adopts voice as a physiological signal and leverages its frequency-time dual domain multimodal characteristics along with deep learning models to develop an intelligent assessment and diagnostic algorithm for depression. Experimental results demonstrate that the proposed method achieves excellent performance in the classification task for depression diagnosis, offering new insights and approaches for the assessment, screening, and diagnosis of depression.
159. 【2510.22217】Enpowering Your Pansharpening Models with Generalizability: Unified Distribution is All You Need
链接:https://arxiv.org/abs/2510.22217
作者:Yongchuan Cui,Peng Liu,Hui Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing deep learning-based, exhibit exceptional performance, remote sensing pansharpening, sensing pansharpening exhibit, pansharpening exhibit exceptional
备注: Accepted to ICCV 2025
点击查看摘要
Abstract:Existing deep learning-based models for remote sensing pansharpening exhibit exceptional performance on training datasets. However, due to sensor-specific characteristics and varying imaging conditions, these models suffer from substantial performance degradation when applied to unseen satellite data, lacking generalizability and thus limiting their applicability. We argue that the performance drops stem primarily from distributional discrepancies from different sources and the key to addressing this challenge lies in bridging the gap between training and testing distributions. To validate the idea and further achieve a "train once, deploy forever" capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). Specifically, we construct a distribution transformation function that normalizes the pixels sampled from different sources to conform to an identical distribution. The deep models are trained on the transformed domain, and during testing on new datasets, the new data are also transformed to match the training distribution. UniPAN aims to train and test the model on a unified and consistent distribution, thereby enhancing its generalizability. Extensive experiments validate the efficacy of UniPAN, demonstrating its potential to significantly enhance the performance of deep pansharpening models across diverse satellite sensors. Codes: this https URL.
160. 【2510.22215】Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy
链接:https://arxiv.org/abs/2510.22215
作者:Juyeon Kim,Geon Lee,Dongwon Choi,Taeuk Kim,Kijung Shin
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:enterprise knowledge management, scientific search, legal discovery, knowledge management, documents is essential
备注:
点击查看摘要
Abstract:Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDOC, the first benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: this https URL
161. 【2510.22214】GALA: A GlobAl-LocAl Approach for Multi-Source Active Domain Adaptation
链接:https://arxiv.org/abs/2510.22214
作者:Juepeng Zheng,Peifeng Zhang,Yibin Wen,Qingmei Li,Yang Zhang,Haohuan Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:leveraging knowledge learned, Domain Adaptation, Multi-Source Domain Adaptation, Active Domain Adaptation, tackle target-domain tasks
备注:
点击查看摘要
Abstract:Domain Adaptation (DA) provides an effective way to tackle target-domain tasks by leveraging knowledge learned from source domains. Recent studies have extended this paradigm to Multi-Source Domain Adaptation (MSDA), which exploits multiple source domains carrying richer and more diverse transferable information. However, a substantial performance gap still remains between adaptation-based methods and fully supervised learning. In this paper, we explore a more practical and challenging setting, named Multi-Source Active Domain Adaptation (MS-ADA), to further enhance target-domain performance by selectively acquiring annotations from the target domain. The key difficulty of MS-ADA lies in designing selection criteria that can jointly handle inter-class diversity and multi-source domain variation. To address these challenges, we propose a simple yet effective GALA strategy (GALA), which combines a global k-means clustering step for target-domain samples with a cluster-wise local selection criterion, effectively tackling the above two issues in a complementary manner. Our proposed GALA is plug-and-play and can be seamlessly integrated into existing DA frameworks without introducing any additional trainable parameters. Extensive experiments on three standard DA benchmarks demonstrate that GALA consistently outperforms prior active learning and active DA methods, achieving performance comparable to the fully-supervised upperbound while using only 1% of the target annotations.
162. 【2510.22213】DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
链接:https://arxiv.org/abs/2510.22213
作者:Yaokun Li,Lihe Ding,Xiao Chen,Guang Tan,Tianfan Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:virtual reality, world simulation, wide applications, applications in virtual, Gaussian Splatting
备注: Project Page: [this https URL](https://dynamictree-dev.github.io/DynamicTree.github.io/)
点击查看摘要
Abstract:Generating dynamic and interactive 3D objects, such as trees, has wide applications in virtual reality, games, and world simulation. Nevertheless, existing methods still face various challenges in generating realistic 4D motion for complex real trees. In this paper, we propose DynamicTree, the first framework that can generate long-term, interactive animation of 3D Gaussian Splatting trees. Unlike prior optimization-based methods, our approach generates dynamics in a fast feed-forward manner. The key success of our approach is the use of a compact sparse voxel spectrum to represent the tree movement. Given a 3D tree from Gaussian Splatting reconstruction, our pipeline first generates mesh motion using the sparse voxel spectrum and then binds Gaussians to deform the mesh. Additionally, the proposed sparse voxel spectrum can also serve as a basis for fast modal analysis under external forces, allowing real-time interactive responses. To train our model, we also introduce 4DTree, the first large-scale synthetic 4D tree dataset containing 8,786 animated tree meshes with semantic labels and 100-frame motion sequences. Extensive experiments demonstrate that our method achieves realistic and responsive tree animations, significantly outperforming existing approaches in both visual quality and computational efficiency.
163. 【2510.22208】Simplifying Knowledge Transfer in Pretrained Models
链接:https://arxiv.org/abs/2510.22208
作者:Siddharth Jain,Shyamgopal Karthik,Vineet Gandhi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:deep learning landscape, offering strong results, current deep learning, learning landscape, offering strong
备注: 12 pages, 3 figures, 6 tables, Accepted at TMLR 2025
点击查看摘要
Abstract:Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design choices exhibit categorically diverse generalization behavior, resulting in one model grasping distinct data-specific insights unavailable to the other. In this paper, we propose to leverage large publicly available model repositories as an auxiliary source of model improvements. We introduce a data partitioning strategy where pretrained models autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge. Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of ViT-B by approximately 1.4% through bidirectional knowledge transfer with ViT-T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state-of-the-art. We further extend our approach to knowledge transfer between multiple models, leading to considerable performance improvements for all model participants.
164. 【2510.22205】rajGATFormer: A Graph-Based Transformer Approach for Worker and Obstacle Trajectory Prediction in Off-site Construction Environments
链接:https://arxiv.org/abs/2510.22205
作者:Mohammed Alduais,Xinming Li,Qipei Mei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:safety risks due, demand grows, industry for processes, brings new safety, offsite construction
备注:
点击查看摘要
Abstract:As the demand grows within the construction industry for processes that are not only faster but also safer and more efficient, offsite construction has emerged as a solution, though it brings new safety risks due to the close interaction between workers, machinery, and moving obstacles. Predicting the future trajectories of workers and taking into account social and environmental factors is a crucial step for developing collision-avoidance systems to mitigate such risks. Traditional methods often struggle to adapt to the dynamic and unpredictable nature of construction environments. Many rely on simplified assumptions or require hand-crafted features, limiting their ability to respond to complex, real-time interactions between workers and moving obstacles. While recent data-driven methods have improved the modeling of temporal patterns, they still face challenges in capturing long-term behavior and accounting for the spatial and social context crucial to collision risk assessment. To address these limitations, this paper proposes a framework integrating YOLOv10n and DeepSORT for precise detection and tracking, along with two novel trajectory prediction models: TrajGATFormer and TrajGATFormer-Obstacle. YOLOv10n serves as the backbone for object detection, accurately identifying workers and obstacles in diverse scenes, while DeepSORT efficiently tracks them over time with unique IDs for continuity. Both models employ a transformer encoder-decoder with Graph Attention Networks (GAT) to capture temporal and spatial interactions. TrajGATFormer predicts worker trajectories with an ADE of 1.25 m and FDE of 2.3 m over a 4.8 s horizon, while TrajGATFormer-Obstacle extends prediction to both workers and obstacles, achieving higher accuracy (ADE 1.15 m, FDE 2.2 m). Comparative analysis shows both models outperform traditional methods, reducing ADE and FDE by up to 35% and 38%, respectively.
165. 【2510.22200】LongCat-Video Technical Report
链接:https://arxiv.org/abs/2510.22200
作者:Meituan LongCat Team:Xunliang Cai,Qilong Huang,Zhuoliang Kang,Hongyu Li,Shijun Liang,Liya Ma,Siyu Ren,Xiaoming Wei,Rixu Xie,Tong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long video generation, Video generation, long video, efficient long video, Video
备注:
点击查看摘要
Abstract:Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
166. 【2510.22199】MOGRAS: Human Motion with Grasping in 3D Scenes
链接:https://arxiv.org/abs/2510.22199
作者:Kunal Bhosikar,Siddharth Katageri,Vivek Madhavaram,Kai Han,Charu Sharma
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:virtual reality, applications in robotics, critical for applications, grasping, scenes
备注: British Machine Vision Conference Workshop - From Scene Understanding to Human Modeling
点击查看摘要
Abstract:Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.
167. 【2510.22196】Scaling Non-Parametric Sampling with Representation
链接:https://arxiv.org/abs/2510.22196
作者:Vincent Lu,Aaron Truong,Zeyu Yun,Yubei Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:produced strikingly photorealistic, strikingly photorealistic image, remain opaque, architectural advances, advances have produced
备注:
点击查看摘要
Abstract:Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.
168. 【2510.22171】HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
链接:https://arxiv.org/abs/2510.22171
作者:Erum Mushtaq,Zalan Fabian,Yavuz Faruk Bakman,Anil Ramakrishna,Mahdi Soltanolkotabi,Salman Avestimehr
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visually impaired individuals, impaired individuals necessitates, individuals necessitates reliable, necessitates reliable mechanisms, growing deployment
备注:
点击查看摘要
Abstract:The growing deployment of Vision-Language Models (VLMs) in high-stakes applications such as autonomous driving and assistive technologies for visually impaired individuals necessitates reliable mechanisms to assess the trustworthiness of their generation. Uncertainty Estimation (UE) plays a central role in quantifying the reliability of model outputs and reducing unsafe generations via selective prediction. In this regard, most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions such as length-normalization. Another line of research leverages model hidden representations and trains MLP-based models to predict uncertainty. However, these methods often fail to capture the complex multimodal relationships between semantic and textual tokens and struggle to identify biased probabilities often influenced by language priors. Motivated by these observations, we propose a novel UE framework, HARMONY, that jointly leverages fused multimodal information in model activations and the output distribution of the VLM to determine the reliability of responses. The key hypothesis of our work is that both the model's internal belief in its visual understanding, captured by its hidden representations, and the produced token probabilities carry valuable reliability signals that can be jointly leveraged to improve UE performance, surpassing approaches that rely on only one of these components. Experimental results on three open-ended VQA benchmarks, A-OKVQA, VizWiz, and PathVQA, and three state-of-the-art VLMs, LLaVa-7b, LLaVA-13b and InstructBLIP demonstrate that our method consistently performs on par with or better than existing approaches, achieving up to 4\% improvement in AUROC, and 6\% in PRR, establishing new state of the art in uncertainty estimation for VLMs.
169. 【2510.22164】LT-Exosense: A Vision-centric Multi-session Mapping System for Lifelong Safe Navigation of Exoskeletons
链接:https://arxiv.org/abs/2510.22164
作者:Jianeng Wang,Matias Mattamala,Christina Kassab,Nived Chebrolu,Guillaume Burger,Fabio Elnecave,Marine Petriaux,Maurice Fallon
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Self-balancing exoskeletons offer, promising mobility solution, Self-balancing exoskeletons, lower-limb disabilities, offer a promising
备注: 8 pages, 4 figures
点击查看摘要
Abstract:Self-balancing exoskeletons offer a promising mobility solution for individuals with lower-limb disabilities. For reliable long-term operation, these exoskeletons require a perception system that is effective in changing environments. In this work, we introduce LT-Exosense, a vision-centric, multi-session mapping system designed to support long-term (semi)-autonomous navigation for exoskeleton users. LT-Exosense extends single-session mapping capabilities by incrementally fusing spatial knowledge across multiple sessions, detecting environmental changes, and updating a persistent global map. This representation enables intelligent path planning, which can adapt to newly observed obstacles and can recover previous routes when obstructions are removed. We validate LT-Exosense through several real-world experiments, demonstrating a scalable multi-session map that achieves an average point-to-point error below 5 cm when compared to ground-truth laser scans. We also illustrate the potential application of adaptive path planning in dynamically changing indoor environments.
170. 【2510.22161】I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions
链接:https://arxiv.org/abs/2510.22161
作者:Shuhong Liu,Lin Gu,Ziteng Cui,Xuangeng Chu,Tatsuya Harada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Participating in efforts, neural radiance field, radiance field framework, efforts to endow, endow generative
备注:
点击查看摘要
Abstract:Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metric perception under media degradation. While existing NeRF models predominantly rely on object-centric sampling, I2-NeRF introduces a reverse-stratified upsampling strategy to achieve near-uniform sampling across 3D space, thereby preserving isometry. We further present a general radiative formulation for media degradation that unifies emission, absorption, and scattering into a particle model governed by the Beer-Lambert attenuation law. By composing the direct and media-induced in-scatter radiance, this formulation extends naturally to complex media environments such as underwater, haze, and even low-light scenes. By treating light propagation uniformly in both vertical and horizontal directions, I2-NeRF enables isotropic metric perception and can even estimate medium properties such as water depth. Experiments on real-world datasets demonstrate that our method significantly improves both reconstruction fidelity and physical plausibility compared to existing approaches.
171. 【2510.22160】SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language
链接:https://arxiv.org/abs/2510.22160
作者:Rahul Ranjan,Mahendra Kumar Gurve,Anuj,Nitin,Yamuna Prasad
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:poses significant challenges, languages poses significant, Developing benchmark datasets, low-resource languages poses, primarily due
备注:
点击查看摘要
Abstract:Developing benchmark datasets for low-resource languages poses significant challenges, primarily due to the limited availability of native linguistic experts and the substantial time and cost involved in annotation. Given these challenges, Maithili is still underrepresented in natural language processing research. It is an Indo-Aryan language spoken by more than 13 million people in the Purvanchal region of India, valued for its rich linguistic structure and cultural significance. While sentiment analysis has achieved remarkable progress in high-resource languages, resources for low-resource languages, such as Maithili, remain scarce, often restricted to coarse-grained annotations and lacking interpretability mechanisms. To address this limitation, we introduce a novel dataset comprising 3,221 Maithili sentences annotated for sentiment polarity and accompanied by natural language justifications. Moreover, the dataset is carefully curated and validated by linguistic experts to ensure both label reliability and contextual fidelity. Notably, the justifications are written in Maithili, thereby promoting culturally grounded interpretation and enhancing the explainability of sentiment models. Furthermore, extensive experiments using both classical machine learning and state-of-the-art transformer architectures demonstrate the dataset's effectiveness for interpretable sentiment analysis. Ultimately, this work establishes the first benchmark for explainable affective computing in Maithili, thus contributing a valuable resource to the broader advancement of multilingual NLP and explainable AI.
172. 【2510.22149】Power to the Clients: Federated Learning in a Dictatorship Setting
链接:https://arxiv.org/abs/2510.22149
作者:Mohammadsajad Alipour,Mohammad Mohammadi Amiri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Federated learning, local data, promising paradigm, collaboratively learn, learn a shared
备注:
点击查看摘要
Abstract:Federated learning (FL) has emerged as a promising paradigm for decentralized model training, enabling multiple clients to collaboratively learn a shared model without exchanging their local data. However, the decentralized nature of FL also introduces vulnerabilities, as malicious clients can compromise or manipulate the training process. In this work, we introduce dictator clients, a novel, well-defined, and analytically tractable class of malicious participants capable of entirely erasing the contributions of all other clients from the server model, while preserving their own. We propose concrete attack strategies that empower such clients and systematically analyze their effects on the learning process. Furthermore, we explore complex scenarios involving multiple dictator clients, including cases where they collaborate, act independently, or form an alliance in order to ultimately betray one another. For each of these settings, we provide a theoretical analysis of their impact on the global model's convergence. Our theoretical algorithms and findings about the complex scenarios including multiple dictator clients are further supported by empirical evaluations on both computer vision and natural language processing benchmarks.
173. 【2510.22142】Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation
链接:https://arxiv.org/abs/2510.22142
作者:Renrong Shao,Wei Zhang,Jun Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Source-free domain adaptation, Source-free domain, Attention Residual Fusion, source domain, domain
备注: 13 pages, 8 figures
点击查看摘要
Abstract:Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation. The complexity of scene information and lack of the source domain make SFDA a difficult task. Recent studies have shown promising results, but many approaches to domain adaptation concentrate on domain shift and neglect the effects of negative transfer, which may impede enhancements of model performance during adaptation. n this paper, addressing this issue, we propose a novel framework of Attention Residual Fusion Network (ARFNet) based on contrast learning for SFDA to alleviate negative transfer and domain shift during the progress of adaptation, in which attention residual fusion, global-local attention contrast, and dynamic centroid evaluation are exploited. Concretely, the attention mechanism is first exploited to capture the discriminative region of the target object. Then, in each block, attention features are decomposed into spatial-wise and channel-wise attentions to achieve the cross-layer attention residual fusion progressively and self-distillation. During adaptation progress, we contrast global and local representations to improve the perceptual capabilities of different categories, which enables the model to discriminate variations between inner-class and intra-class. Finally, a dynamic centroid evaluation strategy is exploited to evaluate the trustworthy centroids and labels for self-supervised self-distillation, which aims to accurately approximate the center of the source domain and pseudo-labels to mitigate domain shift. To validate the efficacy, we execute comprehensive experiments on five benchmarks of varying scales. Experimental outcomes indicate that our method surpasses other techniques, attaining superior performance across SFDA benchmarks.
174. 【2510.22141】LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
链接:https://arxiv.org/abs/2510.22141
作者:Yuhang Gao,Xiang Xiang,Sheng Zhong,Guoyou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:shown significant progress, shown significant, significant progress, Vision-Language Models, Densely Contrastive Learning
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.
175. 【2510.22140】STG-Avatar: Animatable Human Avatars via Spacetime Gaussian
链接:https://arxiv.org/abs/2510.22140
作者:Guangan Jiang,Tianzi Zhang,Dong Li,Zhenjun Zhao,Haoang Li,Mingrui Li,Hongyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Realistic animatable human, immersive virtual experiences, advancing human-robot interaction, enhancing immersive virtual, Realistic animatable
备注: Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025
点击查看摘要
Abstract:Realistic animatable human avatars from monocular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based human avatars has made progress, it still struggles with accurately representing detailed features of non-rigid objects (e.g., clothing deformations) and dynamic regions (e.g., rapidly moving limbs). To address these challenges, we present STG-Avatar, a 3DGS-based framework for high-fidelity animatable human avatar reconstruction. Specifically, our framework introduces a rigid-nonrigid coupled deformation framework that synergistically integrates Spacetime Gaussians (STG) with linear blend skinning (LBS). In this hybrid design, LBS enables real-time skeletal control by driving global pose transformations, while STG complements it through spacetime adaptive optimization of 3D Gaussians. Furthermore, we employ optical flow to identify high-dynamic regions and guide the adaptive densification of 3D Gaussians in these regions. Experimental results demonstrate that our method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quantitative metrics while retaining real-time rendering capabilities. Our code is available at this https URL
176. 【2510.22129】goEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks
链接:https://arxiv.org/abs/2510.22129
作者:Matthias Jammot,Bjöern Braun,Paul Streli,Rafael Wampfler,Christian Holz
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:anticipating human behavior, person emotional states, benchmarks largely ignore, current egocentric vision, Understanding affect
备注: Accepted for publication at NeurIPS 2025
点击查看摘要
Abstract:Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.
177. 【2510.22127】Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions
链接:https://arxiv.org/abs/2510.22127
作者:Wenxuan Bao,Ruxi Deng,Jingrui He
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Pretrained vision-language models, achieve strong zero-shot, strong zero-shot generalization, distribution shifts caused, CLIP achieve strong
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Pretrained vision-language models such as CLIP achieve strong zero-shot generalization but remain vulnerable to distribution shifts caused by input corruptions. In this work, we investigate how corruptions affect CLIP's image embeddings and uncover a consistent phenomenon we term as embedding variance collapse, where both intra-class and inter-class variances shrink as corruption severity increases. We find that this collapse is closely tied to performance degradation, with inter-class variance strongly correlated with classification accuracy. To explain this phenomenon, we analyze how corruptions alter the structure of the embedding space. Our theoretical results suggest that the visual encoder tends to encode corruption-related signals, which dilute class-discriminative features and compress the representation geometry. We further show that maximizing inter-class variance, even when estimated from pseudo-labels, can provably enhance embedding quality. Based on this insight, we propose Mint, a simple test-time adaptation method that maximizes pseudo-label-based inter-class variance on the fly using a mean accumulator and a gradient accumulator. Mint operates effectively with small batch sizes and consistently improves performance across multiple corruption benchmarks and CLIP architectures. Our code is available at this https URL .
178. 【2510.22119】CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding
链接:https://arxiv.org/abs/2510.22119
作者:Lihuang Fang,Xiao Hu,Yuchen Zou,Hong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep stereo matching, Deep stereo, matching has advanced, advanced significantly, significantly on benchmark
备注: 9 pages, 6 figures
点击查看摘要
Abstract:Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach.
179. 【2510.22118】GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
链接:https://arxiv.org/abs/2510.22118
作者:Karim Elmaaroufi,Liheng Lai,Justin Svegliato,Yutong Bai,Sanjit A. Seshia,Matei Zaharia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Language Models, Vision Language, achieve strong performance, Language Models, strong performance
备注: 22 pages, 3 figures, 3 tables, project page: [this https URL](https://ke7.github.io/graid/)
点击查看摘要
Abstract:Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\unicode{x2014}$a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy$\unicode{x2014}$compared to 57.6% on a dataset generated by recent work. Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found $\href{this https URL}{here}$.
180. 【2510.22107】Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation
链接:https://arxiv.org/abs/2510.22107
作者:Bailey Trang,Parham Saremi,Alan Q. Wang,Fangrui Huang,Zahra TehraniNasab,Amar Kumar,Tal Arbel,Li Fei-Fei,Ehsan Adeli
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:prompt-based image generation, Generative Flow Networks, image, image generation, diverse
备注:
点击查看摘要
Abstract:Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets' advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow's improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.
181. 【2510.22102】Mitigating Coordinate Prediction Bias from Positional Encoding Failures
链接:https://arxiv.org/abs/2510.22102
作者:Xingjian Tao,Yiwei Wang,Yujun Cai,Yihong Luo,Jing Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal large language, Multimodal large, VQA and document, prediction remains challenging, large language models
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) excel at vision-language tasks such as VQA and document understanding, yet precise coordinate prediction remains challenging. High-resolution inputs exacerbate this difficulty by producing long token sequences that weaken positional encodings and introduce directional biases in coordinate outputs. We investigate this phenomenon by analyzing how MLLMs behave when visual positional encodings (VPEs) are deliberately perturbed through shuffling. Our analysis reveals that such perturbations induce predictable, non-random coordinate biases rather than random errors, suggesting that models rely on internal positional priors when spatial grounding signals are degraded. Crucially, we observe similar directional error patterns in natural high-resolution datasets, indicating that positional encoding failures are a key bottleneck for accurate coordinate prediction at scale. To address this issue, we propose Vision-PE Shuffle Guidance (VPSG), a training-free test-time method that leverages the directional nature of these biases for correction. VPSG runs auxiliary decoding with shuffled VPEs to isolate position-unconditioned tendencies, then uses this as negative evidence to guide digit prediction while preserving coordinate format through a lightweight finite-state machine. Experiments on ScreenSpot-Pro demonstrate reliable improvements, highlighting positional encoding robustness as a critical factor for spatial reasoning in MLLMs.
182. 【2510.22073】Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement
链接:https://arxiv.org/abs/2510.22073
作者:Luca Caldera,Lara Cavinato,Francesca Ieva
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:imaging sites hinders, hinders consistent analysis, MRI scanner models, Structural Similarity Index, introduced by differences
备注:
点击查看摘要
Abstract:The variability introduced by differences in MRI scanner models, acquisition protocols, and imaging sites hinders consistent analysis and generalizability across multicenter studies. We present a novel image-based harmonization framework for 3D T1-weighted brain MRI, which disentangles anatomical content from scanner- and site-specific variations. The model incorporates a differentiable loss based on the Structural Similarity Index (SSIM) to preserve biologically meaningful features while reducing inter-site variability. This loss enables separate evaluation of image luminance, contrast, and structural components. Training and validation were performed on multiple publicly available datasets spanning diverse scanners and sites, with testing on both healthy and clinical populations. Harmonization using multiple style targets, including style-agnostic references, produced consistent and high-quality outputs. Visual comparisons, voxel intensity distributions, and SSIM-based metrics demonstrated that harmonized images achieved strong alignment across acquisition settings while maintaining anatomical fidelity. Following harmonization, structural SSIM reached 0.97, luminance SSIM ranged from 0.98 to 0.99, and Wasserstein distances between mean voxel intensity distributions decreased substantially. Downstream tasks showed substantial improvements: mean absolute error for brain age prediction decreased from 5.36 to 3.30 years, and Alzheimer's disease classification AUC increased from 0.78 to 0.85. Overall, our framework enhances cross-site image consistency, preserves anatomical fidelity, and improves downstream model performance, providing a robust and generalizable solution for large-scale multicenter neuroimaging studies.
183. 【2510.22070】MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification
链接:https://arxiv.org/abs/2510.22070
作者:Luca Caldera,Giacomo Bottacini,Lara Cavinato
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
关键词:imaging remains limited, medical imaging remains, representation learning, remains limited, task alignment
备注:
点击查看摘要
Abstract:Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.
184. 【2510.22067】Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
链接:https://arxiv.org/abs/2510.22067
作者:Zheng Qi,Chao Shang,Evangelia Spiliopoulou,Nikolaos Pappas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision language models, Vision language, language models, visual, user query
备注:
点击查看摘要
Abstract:Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.
185. 【2510.22056】Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning
链接:https://arxiv.org/abs/2510.22056
作者:Mohammad Ali Etemadi Naeen,Hoda Mohammadzade,Saeed Bagheri Shouraki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:scene-dependent visual clutter, challenging task due, abnormal events, visual clutter, remains a challenging
备注:
点击查看摘要
Abstract:Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages - demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.
186. 【2510.22045】VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT
链接:https://arxiv.org/abs/2510.22045
作者:Hyeonsu Kang,Emily Bao,Anjan Goswami
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision-language models, understanding remains underexplored, including presentation slides, evaluate multimodal content, slide-specific understanding remains
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle - Benchmarks, Emergent Abilities, and Scaling
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (this https URL), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines.
187. 【2510.22035】Caption-Driven Explainability: Probing CNNs for Bias via CLIP
链接:https://arxiv.org/abs/2510.22035
作者:Patrick Koller(Northwestern University, Evanston, Illinois, United States),Amil V. Dravid(University of California, Berkeley, California, United States),Guido M. Schuster(Eastern Switzerland University of Applied Sciences, Rapperswil, St. Gallen, Switzerland),Aggelos K. Katsaggelos(Northwestern University, Evanston, Illinois, United States)
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:machine learning, XAI, caption-based XAI model, caption-based XAI, critical problems
备注: Accepted and presented at the IEEE ICIP 2025 Satellite Workshop "Generative AI for World Simulations and Communications Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos", Anchorage, Alaska, USA, September 14, 2025. Camera-ready preprint; the official IEEE Xplore publication will follow. Code is available at [this https URL](https://github.com/patch0816/caption-driven-xai>)
点击查看摘要
Abstract:Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at this https URL.
188. 【2510.22011】Reconnaissance Automatique des Langues des Signes : Une Approche Hybridée CNN-LSTM Basée sur Mediapipe
链接:https://arxiv.org/abs/2510.22011
作者:Fraisse Sacré Takouchouang,Ho Tuong Vinh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Sign languages play, automatic sign language, sign language recognition, deaf communities, limiting access
备注: in French language
点击查看摘要
Abstract:Sign languages play a crucial role in the communication of deaf communities, but they are often marginalized, limiting access to essential services such as healthcare and education. This study proposes an automatic sign language recognition system based on a hybrid CNN-LSTM architecture, using Mediapipe for gesture keypoint extraction. Developed with Python, TensorFlow and Streamlit, the system provides real-time gesture translation. The results show an average accuracy of 92\%, with very good performance for distinct gestures such as ``Hello'' and ``Thank you''. However, some confusions remain for visually similar gestures, such as ``Call'' and ``Yes''. This work opens up interesting perspectives for applications in various fields such as healthcare, education and public services.
189. 【2510.22010】FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing
链接:https://arxiv.org/abs/2510.22010
作者:Or Ronai,Vladimir Kulikov,Tomer Michaeli
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:controlled generation tasks, generation tasks, remarkable success, success of diffusion, diffusion and flow-matching
备注: Project's webpage at [this https URL](https://orronai.github.io/FlowOpt/)
点击查看摘要
Abstract:The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.
190. 【2510.22004】LiteDiff
链接:https://arxiv.org/abs/2510.22004
作者:Ruchir Namjoshi,Nagasai Thadishetty,Vignesh Kumar,Hemanth Venkateshwara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-fidelity image synthesis, demonstrated remarkable success, recent years, image synthesis, demonstrated remarkable
备注:
点击查看摘要
Abstract:In recent years, diffusion models have demonstrated remarkable success in high-fidelity image synthesis. However, fine-tuning these models for specialized domains, such as medical imaging, remains challenging due to limited domain-specific data and the high computational cost of full model adaptation. In this paper, we introduce Lite-Diff (Lightweight Diffusion Model Adaptation), a novel finetuning approach that integrates lightweight adaptation layers into a frozen diffusion U-Net while enhancing training with a latent morphological autoencoder (for domain-specific latent consistency) and a pixel level discriminator(for adversarial alignment). By freezing weights of the base model and optimizing only small residual adapter modules, LiteDiff significantly reduces the computational overhead and mitigates overfitting, even in minimal-data settings. Additionally, we conduct ablation studies to analyze the effects of selectively integrating adaptation layers in different U-Net blocks, revealing an optimal balance between efficiency and performance. Experiments on three chest X-ray datasets - (1) Kaggle Chest X-Ray Pneumonia, (2) NIH Chest X-ray14 and (3) VinBigData Chest X_ray demonstrate that LiteDiff achieves superior adaptation efficiency compared to naive full fine-tuning. Our framework provides a promising direction for transfer learning in diffusion models, facilitating their deployment in diverse low data domains.
191. 【2510.21986】Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
链接:https://arxiv.org/abs/2510.21986
作者:Dogyun Park,Moayed Haji-Ali,Yanyu Li,Willi Menapace,Sergey Tulyakov,Hyunwoo J. Kim,Aliaksandr Siarohin,Anil Kag
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pretraining prohibitively expensive, sequence length makes, length makes large-scale, makes large-scale pretraining, large-scale pretraining prohibitively
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
192. 【2510.21898】A supervised discriminant data representation: application to pattern classification
链接:https://arxiv.org/abs/2510.21898
作者:Fadi Dornaika,Ahmad Khoder,Abdelmalik Moujahid,Wassim Khoder
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:recognition algorithms generally, algorithms generally depends, pattern recognition algorithms, machine learning algorithms, machine learning
备注:
点击查看摘要
Abstract:The performance of machine learning and pattern recognition algorithms generally depends on data representation. That is why, much of the current effort in performing machine learning algorithms goes into the design of preprocessing frameworks and data transformations able to support effective machine learning. The method proposed in this work consists of a hybrid linear feature extraction scheme to be used in supervised multi-class classification problems. Inspired by two recent linear discriminant methods: robust sparse linear discriminant analysis (RSLDA) and inter-class sparsitybased discriminative least square regression (ICS_DLSR), we propose a unifying criterion that is able to retain the advantages of these two powerful methods. The resulting transformation relies on sparsity-promoting techniques both to select the features that most accurately represent the data and to preserve the row-sparsity consistency property of samples from the same class. The linear transformation and the orthogonal matrix are estimated using an iterative alternating minimization scheme based on steepest descent gradient method and different initialization schemes. The proposed framework is generic in the sense that it allows the combination and tuning of other linear discriminant embedding methods. According to the experiments conducted on several datasets including faces, objects, and digits, the proposed method was able to outperform competing methods in most cases.
193. 【2510.21887】Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications
链接:https://arxiv.org/abs/2510.21887
作者:Shamim Yazdani,Akansha Singh,Nripsuta Saxena,Zichong Wang,Avash Palikhe,Deng Pan,Umapada Pal,Jie Yang,Wenbin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Generative Adversarial Networks, deep learning based, Variational Autoencoders, Adversarial Networks, learning based generative
备注: Accepted by the Journal of Big Data
点击查看摘要
Abstract:In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrumental in in generating diverse, high-quality content across various domains, such as image and video synthesis. This capability has led to widespread adoption of these models and has captured strong public interest. As they continue to advance at a rapid pace, the growing volume of research, expanding application areas, and unresolved technical challenges make it increasingly difficult to stay current. To address this need, this survey introduces a comprehensive taxonomy that organizes the literature and provides a cohesive framework for understanding the development of GANs, VAEs, and DMs, including their many variants and combined approaches. We highlight key innovations that have improved the quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative artificial intelligence. In addition to summarizing technical progress, we examine rising ethical concerns, including the risks of misuse and the broader societal impact of synthetic media. Finally, we outline persistent challenges and propose future research directions, offering a structured and forward looking perspective for researchers in this fast evolving field.
194. 【2510.21879】rnaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge
链接:https://arxiv.org/abs/2510.21879
作者:Shu-Hao Zhang,Wei-Cheng Tang,Chen Wu,Peng Hu,Nan Li,Liang-Jie Zhang,Qi Zhang,Shao-Qun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Contrastive Language-Image Pretraining, image-text contrastive modeling, Recent years, Language-Image Pretraining, contrastive modeling
备注:
点击查看摘要
Abstract:Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99\% ternarized weights with 1.58-bit representation, 16.98 $\times$ compression ratio, 2.3 $\times$ inference acceleration, 16 $\times$ storage reduction, 10 $\times$ memory optimization, and 60\% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
195. 【2510.21876】AI Powered Urban Green Infrastructure Assessment Through Aerial Imagery of an Industrial Township
链接:https://arxiv.org/abs/2510.21876
作者:Anisha Dutta
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:effective environmental monitoring, Accurate assessment, effective environmental, environmental monitoring, climate change
备注: Presented at IIIE Conference 2024, Jamshedpur
点击查看摘要
Abstract:Accurate assessment of urban canopy coverage is crucial for informed urban planning, effective environmental monitoring, and mitigating the impacts of climate change. Traditional practices often face limitations due to inadequate technical requirements, difficulties in scaling and data processing, and the lack of specialized expertise. This study presents an efficient approach for estimating green canopy coverage using artificial intelligence, specifically computer vision techniques, applied to aerial imageries. Our proposed methodology utilizes object-based image analysis, based on deep learning algorithms to accurately identify and segment green canopies from high-resolution drone images. This approach allows the user for detailed analysis of urban vegetation, capturing variations in canopy density and understanding spatial distribution. To overcome the computational challenges associated with processing large datasets, it was implemented over a cloud platform utilizing high-performance processors. This infrastructure efficiently manages space complexity and ensures affordable latency, enabling the rapid analysis of vast amounts of drone imageries. Our results demonstrate the effectiveness of this approach in accurately estimating canopy coverage at the city scale, providing valuable insights for urban forestry management of an industrial township. The resultant data generated by this method can be used to optimize tree plantation and assess the carbon sequestration potential of urban forests. By integrating these insights into sustainable urban planning, we can foster more resilient urban environments, contributing to a greener and healthier future.
196. 【2510.21867】Addressing Corner Cases in Autonomous Driving: A World Model-based Approach with Mixture of Experts and LLMs
链接:https://arxiv.org/abs/2510.21867
作者:Haicheng Liao,Bonan Wang,Junxian Yang,Chengyue Wang,Zhengbin He,Guohui Zhang,Chengzhong Xu,Zhenning Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Accurate and reliable, reliable motion forecasting, autonomous vehicles, safe deployment, deployment of autonomous
备注:
点击查看摘要
Abstract:Accurate and reliable motion forecasting is essential for the safe deployment of autonomous vehicles (AVs), particularly in rare but safety-critical scenarios known as corner cases. Existing models often underperform in these situations due to an over-representation of common scenes in training data and limited generalization capabilities. To address this limitation, we present WM-MoE, the first world model-based motion forecasting framework that unifies perception, temporal memory, and decision making to address the challenges of high-risk corner-case scenarios. The model constructs a compact scene representation that explains current observations, anticipates future dynamics, and evaluates the outcomes of potential actions. To enhance long-horizon reasoning, we leverage large language models (LLMs) and introduce a lightweight temporal tokenizer that maps agent trajectories and contextual cues into the LLM's feature space without additional training, enriching temporal context and commonsense priors. Furthermore, a mixture-of-experts (MoE) is introduced to decompose complex corner cases into subproblems and allocate capacity across scenario types, and a router assigns scenes to specialized experts that infer agent intent and perform counterfactual rollouts. In addition, we introduce nuScenes-corner, a new benchmark that comprises four real-world corner-case scenarios for rigorous evaluation. Extensive experiments on four benchmark datasets (nuScenes, NGSIM, HighD, and MoCAD) showcase that WM-MoE consistently outperforms state-of-the-art (SOTA) baselines and remains robust under corner-case and data-missing conditions, indicating the promise of world model-based architectures for robust and generalizable motion forecasting in fully AVs.
197. 【2510.21864】LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation
链接:https://arxiv.org/abs/2510.21864
作者:Xin Lu,Chuanqing Zhuang,Chenxi Jin,Zhengda Lu,Yiqun Wang,Wu Liu,Jun Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:synchronized digital humans, attracted increasing interest, temporally synchronized digital, digital humans, attracted increasing
备注:
点击查看摘要
Abstract:Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: this https URL.
198. 【2510.21862】A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model
链接:https://arxiv.org/abs/2510.21862
作者:Muhammad Tayyab Khan,Zane Yong,Lequn Chen,Wenhe Feng,Nicholas Yew Jin Tan,Seung Ki Moon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:conveying design intent, design intent, production details, primary medium, medium for conveying
备注: This draft has been submitted to the 13th International Conference on Industrial Engineering and Applications (ICIEA 2026)
点击查看摘要
Abstract:Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GDT symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GDT frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.
199. 【2510.21857】Poisson Flow Consistency Training
链接:https://arxiv.org/abs/2510.21857
作者:Anthony Zhang,Mahmut Gokmen,Dennis Hein,Rongjun Ge,Wenjun Xia,Ge Wang,Jin Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Poisson Flow Consistency, robust Poisson Flow, Poisson Flow, Poisson Flow Generative, Flow Consistency Model
备注: 5 pages, 3 figures, 1 table
点击查看摘要
Abstract:The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generation and CT image denoising. Yet the PFCM can only be trained in distillation which limits the potential of the PFCM in many data modalities. The objective of this research was to create a method to train the PFCM in isolation called Poisson Flow Consistency Training (PFCT). The perturbation kernel was leveraged to remove the pretrained PFGM++, and the sinusoidal discretization schedule and Beta noise distribution were introduced in order to facilitate adaptability and improve sample quality. The model was applied to the task of low dose computed tomography image denoising and improved the low dose image in terms of LPIPS and SSIM. It also displayed similar denoising effectiveness as models like the Consistency Model. PFCT is established as a valid method of training the PFCM from its effectiveness in denoising CT images, showing potential with competitive results to other generative models. Further study is needed in the precise optimization of PFCT and in its applicability to other generative modeling tasks. The framework of PFCT creates more flexibility for the ways in which a PFCM can be created and can be applied to the field of generative modeling.
200. 【2510.21850】SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
链接:https://arxiv.org/abs/2510.21850
作者:Gyubeum Lim,Yemo Koo,Vijay Krishna Madisetti
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:long-context visual information, visual information remains, Understanding long-context visual, GUI control, information remains
备注:
点击查看摘要
Abstract:Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to reduce the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
201. 【2510.21842】Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?
链接:https://arxiv.org/abs/2510.21842
作者:Michael Aerni,Joshua Swanson,Kristina Nikolić,Florian Tramèr
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:models accurately memorize, accurately memorize concepts, memorize concepts visually, current unified multimodal, unified multimodal models
备注:
点击查看摘要
Abstract:We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model aligned solely on text remains capable of generating unsafe images.
202. 【2510.21841】RatioWaveNet: A Learnable RDWT Front-End for Robust and Interpretable EEG Motor-Imagery Classification
链接:https://arxiv.org/abs/2510.21841
作者:Marco Siino,Giuseppe Bonomo,Rosario Sorbello,Ilenia Tinnirello
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:non-invasive EEG remains, EEG remains challenging, translate covert movement, covert movement intentions, remains challenging due
备注:
点击查看摘要
Abstract:Brain-computer interfaces (BCIs) based on motor imagery (MI) translate covert movement intentions into actionable commands, yet reliable decoding from non-invasive EEG remains challenging due to nonstationarity, low SNR, and subject variability. We present RatioWaveNet, which augments a strong temporal CNN-Transformer backbone (TCFormer) with a trainable, Rationally-Dilated Wavelet Transform (RDWT) front end. The RDWT performs an undecimated, multi-resolution subband decomposition that preserves temporal length and shift-invariance, enhancing sensorimotor rhythms while mitigating jitter and mild artifacts; subbands are fused via lightweight grouped 1-D convolutions and passed to a multi-kernel CNN for local temporal-spatial feature extraction, a grouped-query attention encoder for long-range context, and a compact TCN head for causal temporal integration. Our goal is to test whether this principled wavelet front end improves robustness precisely where BCIs typically fail - on the hardest subjects - and whether such gains persist on average across seeds under both intra- and inter-subject protocols. On BCI-IV-2a and BCI-IV-2b, across five seeds, RatioWaveNet improves worst-subject accuracy over the Transformer backbone by +0.17 / +0.42 percentage points (Sub-Dependent / LOSO) on 2a and by +1.07 / +2.54 percentage points on 2b, with consistent average-case gains and modest computational overhead. These results indicate that a simple, trainable wavelet front end is an effective plug-in to strengthen Transformer-based BCIs, improving worst-case reliability without sacrificing efficiency.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Cite as:
arXiv:2510.21841 [cs.CV]
(or
arXiv:2510.21841v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.21841
Focus to learn more
arXiv-issued DOI via DataCite</p>
203. 【2510.21840】Improving the Physics of Video Generation with VJEPA-2 Reward Signal
链接:https://arxiv.org/abs/2510.21840
作者:Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Perception Test Workshop, Workshop at ICCV, Perception Test, Test Workshop, PhysicsIQ Challenge
备注: 2 pages
点击查看摘要
Abstract:This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.
204. 【2510.21839】Evaluating ChatGPT's Performance in Classifying Pneumonia from Chest X-Ray Images
链接:https://arxiv.org/abs/2510.21839
作者:Pragna Prahallad,Pranathi Prahallad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:classify chest X-ray, NORMAL or PNEUMONIA, chest X-ray images, chest X-ray, ability of OpenAI
备注:
点击查看摘要
Abstract:In this study, we evaluate the ability of OpenAI's gpt-4o model to classify chest X-ray images as either NORMAL or PNEUMONIA in a zero-shot setting, without any prior fine-tuning. A balanced test set of 400 images (200 from each class) was used to assess performance across four distinct prompt designs, ranging from minimal instructions to detailed, reasoning-based prompts. The results indicate that concise, feature-focused prompts achieved the highest classification accuracy of 74\%, whereas reasoning-oriented prompts resulted in lower performance. These findings highlight that while ChatGPT exhibits emerging potential for medical image interpretation, its diagnostic reliability remains limited. Continued advances in visual reasoning and domain-specific adaptation are required before such models can be safely applied in clinical practice.
205. 【2510.21835】A Multimodal, Multitask System for Generating E Commerce Text Listings from Images
链接:https://arxiv.org/abs/2510.21835
作者:Nayan Kumar Singh
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Manually generating catchy, generating catchy descriptions, Manually generating, generating catchy, catchy descriptions
备注: 24 pages, 10 figures, 11 tables. Code can be found at: [this https URL](https://github.com/SinghNayanKumar/multimodal-product-lister/)
点击查看摘要
Abstract:Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
206. 【2510.21833】owards Accurate and Efficient Waste Image Classification: A Hybrid Deep Learning and Machine Learning Approach
链接:https://arxiv.org/abs/2510.21833
作者:Ngoc-Bao-Quang Nguyen,Tuan-Minh Do,Cong-Tam Phan,Thi-Thu-Hong Phan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated image-based garbage, integrate Machine Learning, solutions remain underdeveloped, Support Vector Machine, image-based garbage classification
备注: 31 pages; 7 figures; 16 tables
点击查看摘要
Abstract:Automated image-based garbage classification is a critical component of global waste management; however, systematic benchmarks that integrate Machine Learning (ML), Deep Learning (DL), and efficient hybrid solutions remain underdeveloped. This study provides a comprehensive comparison of three paradigms: (1) machine learning algorithms using handcrafted features, (2) deep learning architectures, including ResNet variants and EfficientNetV2S, and (3) a hybrid approach that utilizes deep models for feature extraction combined with classical classifiers such as Support Vector Machine and Logistic Regression to identify the most effective strategy. Experiments on three public datasets - TrashNet, Garbage Classification, and a refined Household Garbage Dataset (with 43 corrected mislabels)- demonstrate that the hybrid method consistently outperforms the others, achieving up to 100% accuracy on TrashNet and the refined Household set, and 99.87% on Garbage Classification, thereby surpassing state-of-the-art benchmarks. Furthermore, feature selection reduces feature dimensionality by over 95% without compromising accuracy, resulting in faster training and inference. This work establishes more reliable benchmarks for waste classification and introduces an efficient hybrid framework that achieves high accuracy while reducing inference cost, making it suitable for scalable deployment in resource-constrained environments.
207. 【2510.21829】A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis
链接:https://arxiv.org/abs/2510.21829
作者:Yi Yin,Yuntao Shou,Zao Dai,Yun Peng,Tao Meng,Wei Ai,Keqin Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical data-based survival, multimodal medical data-based, multimodal survival analysis, data-based survival analysis, incomplete multimodal survival
备注: 12 pages, 4 figures
点击查看摘要
Abstract:In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.
208. 【2510.21828】Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
链接:https://arxiv.org/abs/2510.21828
作者:Yichi Zhang,Zhuo Chen,Lingbing Guo,Lei Liang,Wen Zhang,Huajun Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:visual modality presents, modality presents significant, presents significant challenges, current multi-modal large, multi-modal large language
备注: Work in Progress. Code and data will be released at [this https URL](https://github.com/zjukg/STAR)
点击查看摘要
Abstract:Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.
209. 【2510.21827】Precise classification of low quality G-banded Chromosome Images by reliability metrics and data pruning classifier
链接:https://arxiv.org/abs/2510.21827
作者:Mojtaba Moattari
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:accurate meta-phase analyzes, high resolution cameras, high quality train, quality train data, meta-phase analyzes
备注:
点击查看摘要
Abstract:In the last decade, due to high resolution cameras and accurate meta-phase analyzes, the accuracy of chromosome classification has improved substantially. However, current Karyotyping systems demand large number of high quality train data to have an adequately plausible Precision per each chromosome. Such provision of high quality train data with accurate devices are not yet accomplished in some out-reached pathological laboratories. To prevent false positive detections in low-cost systems and low-quality images settings, this paper improves the classification Precision of chromosomes using proposed reliability thresholding metrics and deliberately engineered features. The proposed method has been evaluated using a variation of deep Alex-Net neural network, SVM, K Nearest-Neighbors, and their cascade pipelines to an automated filtering of semi-straight chromosome. The classification results have highly improved over 90% for the chromosomes with more common defections and translocations. Furthermore, a comparative analysis over the proposed thresholding metrics has been conducted and the best metric is bolded with its salient characteristics. The high Precision results provided for a very low-quality G-banding database verifies suitability of the proposed metrics and pruning method for Karyotyping facilities in poor countries and lowbudget pathological laboratories.
210. 【2510.21823】Explainable Deep Learning in Medical Imaging: Brain Tumor and Pneumonia Detection
链接:https://arxiv.org/abs/2510.21823
作者:Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:holds enormous potential, improving medical imaging, hampers clinical trust, Convolutional Neural Networks, Deep Learning
备注: Published in IEEE
点击查看摘要
Abstract:Deep Learning (DL) holds enormous potential for improving medical imaging diagnostics, yet the lack of interpretability in most models hampers clinical trust and adoption. This paper presents an explainable deep learning framework for detecting brain tumors in MRI scans and pneumonia in chest X-ray images using two leading Convolutional Neural Networks, ResNet50 and DenseNet121. These models were trained on publicly available Kaggle datasets comprising 7,023 brain MRI images and 5,863 chest X-ray images, achieving high classification performance. DenseNet121 consistently outperformed ResNet50 with 94.3 percent vs. 92.5 percent accuracy for brain tumors and 89.1 percent vs. 84.4 percent accuracy for pneumonia. For better explainability, Gradient-weighted Class Activation Mapping (Grad-CAM) was integrated to create heatmap visualizations superimposed on the test images, indicating the most influential image regions in the decision-making process. Interestingly, while both models produced accurate results, Grad-CAM showed that DenseNet121 consistently focused on core pathological regions, whereas ResNet50 sometimes scattered attention to peripheral or non-pathological areas. Combining deep learning and explainable AI offers a promising path toward reliable, interpretable, and clinically useful diagnostic tools.
211. 【2510.21822】Wavelet-based GAN Fingerprint Detection using ResNet50
链接:https://arxiv.org/abs/2510.21822
作者:Sai Teja Erukude,Suhasnadh Reddy Veluru,Viswa Chaitanya Marella
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Generative Adversarial Networks, Identifying images generated, digital image forensics, Generative Adversarial, Adversarial Networks
备注: 6 pages; Published in IEEE
点击查看摘要
Abstract:Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or "fingerprints." The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.
212. 【2510.21821】Prompt fidelity of ChatGPT4o / Dall-E3 text-to-image visualisations
链接:https://arxiv.org/abs/2510.21821
作者:Dirk HR Spennemann
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:autogenously generated prompts, resulting images, autogenously generated, visualisations, correctly rendered
备注:
点击查看摘要
Abstract:This study examines the prompt fidelity of ChatGPT4o / DALL-E3 text-to-image visualisations by analysing whether attributes explicitly specified in autogenously generated prompts are correctly rendered in the resulting images. Using two public-domain datasets comprising 200 visualisations of women working in the cultural and creative industries and 230 visualisations of museum curators, the study assessed accuracy across personal attributes (age, hair), appearance (attire, glasses), and paraphernalia (name tags, clipboards). While correctly rendered in most cases, DALL-E3 deviated from prompt specifications in 15.6% of all attributes (n=710). Errors were lowest for paraphernalia, moderate for personal appearance, and highest for depictions of the person themselves, particularly age. These findings demonstrate measurable prompt-to-image fidelity gaps with implications for bias detection and model evaluation.
213. 【2510.21814】Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding
链接:https://arxiv.org/abs/2510.21814
作者:Zhuoming Li,Aitong Liu,Mengxi Jia,Tengxiang Zhang,Dell Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:predefined gesture categories, human-computer interaction, Free-form gesture, appealing for human-computer, liberates users
备注: IMWUT2025
点击查看摘要
Abstract:Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs' lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model's ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.
214. 【2510.21813】SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling
链接:https://arxiv.org/abs/2510.21813
作者:Samuel J. Barrett,Docko Sow
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Earth Observation, holds great promise, holds great, great promise, promise for simplifying
备注: 27 pages, 7 figures
点击查看摘要
Abstract:Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require additional adaptation before they can be used and are structured rigidly around particular data sources or training approaches. To address this, we take inspiration from large language models, where diverse tasks, both pre-training and downstream, are implicitly captured through next-token prediction over unified token sequences, leveraging the structure and diversity of the training data. We introduce SITS-DECO (Satellite Image Time Series-DECoder Only), a proof-of-concept generative model that applies this unified-sequence framing to EO data. Using a simple GPT-style decoder-only architecture, and demonstrate its ability to perform useful EO tasks (pixel-wise, multi-temporal, multi-modal crop-type classification) in a purely generative framework. Through symbolic prompting, we show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture, without task- or modality-specific adaptation. Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification (PASTIS-R) demonstrating that dense temporal sequence modelling is a critical missing ingredient in the current paradigm. This work exemplifies a data-centric modelling paradigm in which capability arises from the diversity and structure of the training data rather than from architectural complexity. SITS-DECO provides a lightweight, practical route to multi-modal, multi-task EO modelling, and a conceptual bridge toward future generative EO foundation models.
Comments:
27 pages, 7 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2510.21813 [cs.CV]
(or
arXiv:2510.21813v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.21813
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
215. 【2510.21811】Comparative Analysis of Object Detection Algorithms for Surface Defect Detection
链接:https://arxiv.org/abs/2510.21811
作者:Arpan Maity,Tamal Ghosh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Fast R-CNN, industrial quality control, NEU-DET surface defect, comprising images representing, prominent object detection
备注: 14 pages, 8 figures
点击查看摘要
Abstract:This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset, comprising images representing various metal surface defects, a crucial application in industrial quality control. Each model's performance was assessed regarding detection accuracy, speed, and robustness across different defect types such as scratches, inclusions, and rolled-in scales. YOLOv11, a state-of-the-art real-time object detection algorithm, demonstrated superior performance compared to the other methods, achieving a remarkable 70% higher accuracy on average. This improvement can be attributed to YOLOv11s enhanced feature extraction capabilities and ability to process the entire image in a single forward pass, making it faster and more efficient in detecting minor surface defects. Additionally, YOLOv11's architecture optimizations, such as improved anchor box generation and deeper convolutional layers, contributed to more precise localization of defects. In conclusion, YOLOv11's outstanding performance in accuracy and speed solidifies its position as the most effective model for surface defect detection on the NEU dataset, surpassing competing algorithms by a substantial margin.
216. 【2510.21810】Hybrid Deep Learning Framework for Enhanced Diabetic Retinopathy Detection: Integrating Traditional Features with AI-driven Insights
链接:https://arxiv.org/abs/2510.21810
作者:Arpan Maity,Aviroop Pal,MD. Samiul Islam,Tamal Ghosh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:highest diabetic populations, Dia-betes Mellitus, major global concern, Diabetic Retinopathy, complication of Dia-betes
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Diabetic Retinopathy (DR), a vision-threatening complication of Dia-betes Mellitus (DM), is a major global concern, particularly in India, which has one of the highest diabetic populations. Prolonged hyperglycemia damages reti-nal microvasculature, leading to DR symptoms like microaneurysms, hemor-rhages, and fluid leakage, which, if undetected, cause irreversible vision loss. Therefore, early screening is crucial as DR is asymptomatic in its initial stages. Fundus imaging aids precise diagnosis by detecting subtle retinal lesions. This paper introduces a hybrid diagnostic framework combining traditional feature extraction and deep learning (DL) to enhance DR detection. While handcrafted features capture key clinical markers, DL automates hierarchical pattern recog-nition, improving early diagnosis. The model synergizes interpretable clinical data with learned features, surpassing standalone DL approaches that demon-strate superior classification and reduce false negatives. This multimodal AI-driven approach enables scalable, accurate DR screening, crucial for diabetes-burdened regions.
217. 【2510.21809】Embodied Navigation with Auxiliary Task of Action Description Prediction
链接:https://arxiv.org/abs/2510.21809
作者:Haru Kondoh,Asako Kanezaki
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:garnered significant attention, recent years, indoor environments, environments has garnered, garnered significant
备注: ICCV 2025 Poster
点击查看摘要
Abstract:The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
218. 【2510.21808】Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning
链接:https://arxiv.org/abs/2510.21808
作者:Jiaao Yu,Mingjie Han,Jinkun Jiang,Junyu Dong,Tao Gong,Man Lan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:training deep learning, deep learning models, data-limited scenarios, high cost, cost of data
备注: 5 pages
点击查看摘要
Abstract:The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross-category generalization, giving rise to the demand for Domain-Adaptive Zero-Shot Learning (DAZSL). Although vision-language models (e.g., CLIP) have inherent advantages in the DAZSL field, current studies do not fully exploit their potential. Applying CLIP to DAZSL faces two core challenges: inefficient cross-category knowledge transfer due to the lack of semantic relation guidance, and degraded cross-modal alignment during target domain fine-tuning. To address these issues, we propose a Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework, integrating a Semantic Relation Structure Loss and a Cross-Modal Alignment Retention Strategy. As the first CLIP-based DAZSL method, SRE-CLIP achieves state-of-the-art performance on the I2AwA and I2WebV benchmarks, significantly outperforming existing approaches.
219. 【2510.21807】Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
链接:https://arxiv.org/abs/2510.21807
作者:Jiaao Yu,Shenwei Li,Mingjie Han,Yifei Yin,Wenzheng Song,Chenghao Jia,Man Lan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent breakthroughs, large language models, verifiable rewards, context and commonsense, markedly advanced
备注: 9 pages
点击查看摘要
Abstract:Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
220. 【2510.21806】Frame-Difference Guided Dynamic Region Perception for CLIP Adaptation in Text-Video Retrieval
链接:https://arxiv.org/abs/2510.21806
作者:Jiaao Yu,Mingjie Han,Tao Gong,Jian Zhang,Man Lan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:numerous application scenarios, text-video retrieval technology, recommendation and search, rapid growth, increasingly important
备注: 5 pages
点击查看摘要
Abstract:With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.
221. 【2510.21802】It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps
链接:https://arxiv.org/abs/2510.21802
作者:Pedro Cisneros-Velarde
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:limited number, denoising steps, make denoising steps, samplers, Abstract
备注:
点击查看摘要
Abstract:We consider the situation where we have a limited number of denoising steps, i.e., of evaluations of a diffusion model. We show that two parallel processors or samplers under such limitation can improve the quality of the sampled image. Particularly, the two samplers make denoising steps at successive times, and their information is appropriately integrated in the latent image. Remarkably, our method is simple both conceptually and to implement: it is plug--play, model agnostic, and does not require any additional fine-tuning or external models. We test our method with both automated and human evaluations for different diffusion models. We also show that a naive integration of the information from the two samplers lowers sample quality. Finally, we find that adding more parallel samplers does not necessarily improve sample quality.
222. 【2510.21801】Morphology-Aware KOA Classification: Integrating Graph Priors with Vision Models
链接:https://arxiv.org/abs/2510.21801
作者:Marouane Tliba,Mohamed Amine Kerkouri,Yassine Nasser,Nour Aburaed,Aladine Chetouani,Ulas Bagci,Rachid Jennane
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:radiographs remains challenging, remains challenging due, standard deep learning, deep learning models, learning models struggle
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:Knee osteoarthritis (KOA) diagnosis from radiographs remains challenging due to the subtle morphological details that standard deep learning models struggle to capture effectively. We propose a novel multimodal framework that combines anatomical structure with radiographic features by integrating a morphological graph representation - derived from Segment Anything Model (SAM) segmentations - with a vision encoder. Our approach enforces alignment between geometry-informed graph embeddings and radiographic features through mutual information maximization, significantly improving KOA classification accuracy. By constructing graphs from anatomical features, we introduce explicit morphological priors that mirror clinical assessment criteria, enriching the feature space and enhancing the model's inductive bias. Experiments on the Osteoarthritis Initiative dataset demonstrate that our approach surpasses single-modality baselines by up to 10\% in accuracy (reaching nearly 80\%), while outperforming existing state-of-the-art methods by 8\% in accuracy and 11\% in F1 score. These results underscore the critical importance of incorporating anatomical structure into radiographic analysis for accurate KOA severity grading.
223. 【2510.21798】AI-Boosted Video Annotation: Assessing the Process Enhancement
链接:https://arxiv.org/abs/2510.21798
作者:Juan Gutiérrez,Ángel Mora,Pablo Regodón,Silvia Rodriguez,José Luis Blanco
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:integrating automatic capabilities, assess their performance, video annotation, explore the enhancement, capabilities to ease
备注:
点击查看摘要
Abstract:We explore the enhancement of Human-in-the-Loop video annotation by integrating automatic capabilities to ease the task for annotators and assess their performance. The research delves into the practical implications of the annotation processes, the integration of AI components, and the evaluation of its outcomes. We analyze their impact on efficiency, accuracy, and overall annotation quality. Focusing on the Human-in-the-Loop for video annotation tasks, we implemented a single-iteration scheme using Label Studio and AI-powered zero-shot pre-annotations. Using this framework, we designed a test based on the annotation of the UCF-Crime dataset to discriminate between normal and abnormal activities in video footage. Our results evidence how automatic AI-based pre-annotation can streamline the video annotation workflow, empowering human annotators and optimizing the overall pipeline. Using the pre-annotated data, we observed a 35% reduction in the annotation time for 70% of the annotators with similar quality annotations, compared to the traditional manual annotation task. Results are consistent with asset duration and complexity. We also observed that while annotators rapidly learned to use the tool, the produced annotations are more coherent among annotators and better match the natural clustering of the video frames.
224. 【2510.21795】Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention
链接:https://arxiv.org/abs/2510.21795
作者:Yinbo Sun,Yuchen Fang,Zhibo Zhu,Jia Li,Yu Liu,Qiwen Deng,Jun Zhou,Hang Yu,Xingyu Lu,Lintao Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:time series foundation, series foundation models, time series, time series data, rapid advancement
备注:
点击查看摘要
Abstract:The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA architecture, we introduce Xihe, a scalable TSFM family spanning from an ultra-efficient 9.5M parameter configuration to high-capacity 1.5B variant. Evaluated on the comprehensive GIFT-Eval benchmark, our most compact Xihe-tiny model (9.5M) surpasses the majority of contemporary TSFMs, demonstrating remarkable parameter efficiency. More impressively, Xihe-max (1.5B) establishes new state-of-the-art zero-shot performance, surpassing previous best results by a substantial margin. This consistent performance excellence across the entire parameter spectrum provides compelling evidence for the exceptional generalization capabilities and architectural superiority of HIBA.
225. 【2510.21794】oken-Level Inference-Time Alignment for Vision-Language Models
链接:https://arxiv.org/abs/2510.21794
作者:Kejia Chen,Jiawen Zhang,Jiacong Hu,Kewei Gao,Jian Lou,Zunlei Feng,Mingli Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:modern multimodal intelligence, outputs remain prone, hallucination-plausible text misaligned, multimodal intelligence, visual inputs
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.
226. 【2510.21793】2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection
链接:https://arxiv.org/abs/2510.21793
作者:Usman Ali,Ali Zia,Abdul Rehman,Umer Ramzan,Zohaib Hassan,Talha Sattar,Jing Wang,Wei Xiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:fusion remains challenging, robust cross-modal fusion, cross-modal fusion remains, Industrial anomaly detection, Attention-Driven Fusion Restoration
备注: Accepted at 26th International Conference on Digital Image Computing: Techniques and Applications (DICTA 2025)
点击查看摘要
Abstract:Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at this https URL
227. 【2510.21791】Exploring the design space of diffusion and flow models for data fusion
链接:https://arxiv.org/abs/2510.21791
作者:Niraj Chaudhari,Manmeet Singh,Naveen Sudharsan,Amit Kumar Srivastava,Harsh Kamath,Dushyant Mahajan,Ayan Paul
类目:Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
关键词:Operational Linescan System, Imaging Radiometer Suite, Program Operational Linescan, Visible Infrared Imaging, Infrared Imaging Radiometer
备注:
点击查看摘要
Abstract:Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sensing, where fusing multi-sensor observations can improve spatial and temporal resolution. In this study, we explore the design space of diffusion and flow models for data fusion, focusing on the integration of Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) and Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights data. Our approach leverages a diverse set of 2D image-to-image generative models, including UNET, diffusion, and flow modeling architectures. We evaluate the effectiveness of these architectures in satellite remote sensing data fusion, identifying diffusion models based on UNet as particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. We also provide guidance on the selection of noise schedulers in diffusion-based models, highlighting the trade-offs between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions. Additionally, we explore quantization techniques to optimize memory efficiency and computational cost without compromising performance. Our findings offer practical insights into selecting the most effective diffusion and flow model architectures for data fusion tasks, particularly in remote sensing applications, and provide recommendations for leveraging noise scheduling strategies to enhance fusion quality.
228. 【2510.21787】Mismatch reconstruction theory for unknown measurement matrix in imaging through multimode fiber bending
链接:https://arxiv.org/abs/2510.21787
作者:Le Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:Multimode fiber imaging, imaging requires strict, requires strict matching, fiber imaging requires, Multimode fiber
备注:
点击查看摘要
Abstract:Multimode fiber imaging requires strict matching between measurement value and measurement matrix to achieve image reconstruction. However, in practical applications, the measurement matrix often cannot be obtained due to unknown system configuration or difficulty in real-time alignment after arbitrary fiber bending, resulting in the failure of traditional reconstruction algorithms. This paper presents a novel mismatch reconstruction theory for solving the problem of image reconstruction when measurement matrix is unknown. We first propose mismatch equation and design matched and calibration solution algorithms to construct a new measurement matrix. In addition, we also provide a detailed proof of these equations and algorithms in the appendix. The experimental results show that under low noise levels, constructed matrix can be used for matched pair in traditional reconstruction algorithms, and reconstruct the original image successfully. Then, we analyze the impact of noise, computational precision and orthogonality on reconstruction performance. The results show that proposed algorithms have a certain degree of robustness. Finally, we discuss the limitations and potential applications of this theory. The code is available: this https URL.
229. 【2510.21786】EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction
链接:https://arxiv.org/abs/2510.21786
作者:Qile Su,Shoutai Zhu,Shuai Zhang,Baoyu Liang,Chao Tong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:achieving remarkable success, Script event induction, Video Event Prediction, Action-centric Video Event, Video
备注: 15 pages, 7 figures, 6 tables
点击查看摘要
Abstract:Script event induction, which aims to predict the subsequent event based on the context, is a challenging task in NLP, achieving remarkable success in practical applications. However, human events are mostly recorded and presented in the form of videos rather than scripts, yet there is a lack of related research in the realm of vision. To address this problem, we introduce AVEP (Action-centric Video Event Prediction), a task that distinguishes itself from existing video prediction tasks through its incorporation of more complex logic and richer semantic information. We present a large structured dataset, which consists of about $35K$ annotated videos and more than $178K$ video clips of event, built upon existing video event datasets to support this task. The dataset offers more fine-grained annotations, where the atomic unit is represented as a multimodal event argument node, providing better structured representations of video events. Due to the complexity of event structures, traditional visual models that take patches or frames as input are not well-suited for AVEP. We propose EventFormer, a node-graph hierarchical attention based video event prediction model, which can capture both the relationships between events and their arguments and the coreferencial relationships between arguments. We conducted experiments using several SOTA video prediction models as well as LVLMs on AVEP, demonstrating both the complexity of the task and the value of the dataset. Our approach outperforms all these video prediction models. We will release the dataset and code for replicating the experiments and annotations.
230. 【2510.21785】Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound
链接:https://arxiv.org/abs/2510.21785
作者:Arun Muthukkumar
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:estimation is essential, Pose estimation, computer vision, Abstract, Pose
备注: 5 pages, 3 figures, 1 table. Presented at IEEE/CVF International Conference on Computer Vision (ICCV 2025) and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
点击查看摘要
Abstract:Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed-form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render-aware Cramér-Rao bound. Our approach reduces to classical bundle-adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi-agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.
231. 【2510.21783】Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models
链接:https://arxiv.org/abs/2510.21783
作者:Guo Li,Yuyang Yu,Xuemiao Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:generating high-quality images, Diffusion models, demonstrated powerful performance, Diffusion, demonstrated powerful
备注:
点击查看摘要
Abstract:Diffusion models have demonstrated powerful performance in generating high-quality images. A typical example is text-to-image generator like Stable Diffusion. However, their widespread use also poses potential privacy risks. A key concern is membership inference attacks, which attempt to determine whether a particular data sample was used in the model training process. We propose an efficient membership inference attack method against diffusion models. This method is based on the injection of slight noise and the evaluation of the aggregation degree of the noise distribution. The intuition is that the noise prediction patterns of diffusion models for training set samples and non-training set samples exhibit distinguishable this http URL, we suppose that member images exhibit higher aggregation of predicted noise around a certain time step of the diffusion process. In contrast, the predicted noises of non-member images exhibit a more discrete characteristic around the certain time step. Compared with other existing methods, our proposed method requires fewer visits to the target diffusion model. We inject slight noise into the image under test and then determine its membership by analyzing the aggregation degree of the noise distribution predicted by the model. Empirical findings indicate that our method achieves superior performance across multiple datasets. At the same time, our method can also show better attack effects in ASR and AUC when facing large-scale text-to-image diffusion models, proving the scalability of our method.
232. 【2510.21782】Promptable Fire Segmentation: Unleashing SAM2's Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance
链接:https://arxiv.org/abs/2510.21782
作者:Emmanuel U. Ugwu,Zhang Xinming
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flames' irregular boundaries, highly variable intensities, computer vision due, irregular boundaries, variable intensities
备注: Accepted for presentation at the 9th International Conference on Image and Graphics Processing (ICIGP 2026) will be held in Wuhan, China during January 16-18, 2026 (publication forthcoming). 6 pages, 3 figures, 3 tables
点击查看摘要
Abstract:Fire segmentation remains a critical challenge in computer vision due to flames' irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (SAM and SAM2) have demonstrated impressive cross-domain generalization capabilities, their effectiveness in fire segmentation -- particularly under mobile deployment constraints -- remains largely unexplored. This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, focusing on bounding box prompting strategies to enhance deployment feasibility. We systematically evaluate four SAM2.1 variants (tiny, small, base_plus, large) alongside mobile-oriented variants (TinySAM, MobileSAM) across three fire datasets using multiple prompting strategies: automatic, single positive point (SP), single positive point + single negative point (SP+SN), multiple positive points (MP), bounding box (Box), and hybrid variants (Box+SP and Box+MP). Our experimental results demonstrate that bounding box prompts consistently outperform automatic and single point-based approaches, with Box+MP achieving the highest mean IoU (0.64) and Dice coefficient (0.75) on the Khan dataset. Lightweight variants such as TinySAM and MobileSAM further reduce memory and computational costs, making them more suitable for latency-tolerant edge scenarios. Overall, this work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications. Code is available at: this https URL
233. 【2510.21781】EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning
链接:https://arxiv.org/abs/2510.21781
作者:Runchu Donga,Peng Zhao,Guiqin Wang,Nan Qi,Jie Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Real-time video analytics, analytics systems typically, systems typically deploy, typically deploy lightweight, video analytics systems
备注:
点击查看摘要
Abstract:Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factors such as changing lighting and weather conditions, leading to decreased model accuracy. Recent frameworks try to address this issue by leveraging remote servers to continuously train and adapt lightweight edge models using more complex models in the cloud. Despite these advancements, existing methods face two key challenges: first, the retraining process is compute-intensive, causing significant delays in model updates; second, the new model may not align well with the evolving data distribution of the current video stream. To address these challenges, we introduce EdgeSync, an efficient edge-model updating approach that enhances sample filtering by incorporating timeliness and inference results, thus ensuring training samples are more relevant to the current video content while reducing update delays. Additionally, EdgeSync features a dynamic training management module that optimizes the timing and sequencing of model updates to improve their timeliness. Evaluations on diverse and complex real-world datasets demonstrate that EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.
234. 【2510.21780】Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection
链接:https://arxiv.org/abs/2510.21780
作者:Bishal Chhetri,B.V. Rathish Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:fine needle aspirate, digitized fine needle, quantitative features extracted, deep learning framework, interpretable deep learning
备注: 15 pages, 14 figures
点击查看摘要
Abstract:In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) images of breast masses. Our deep neural network, using ReLU activations, the Adam optimizer, and a binary cross-entropy loss, delivers state-of-the-art classification performance, achieving an accuracy of 0.992, precision of 1.000, recall of 0.977, and an F1 score of 0.988. These results substantially exceed the benchmarks reported in the literature. We evaluated the model under identical protocols against a suite of well-established algorithms (logistic regression, decision trees, random forests, stochastic gradient descent, K-nearest neighbors, and XGBoost) and found the deep model consistently superior on the same metrics. Recognizing that high predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, we incorporated model-agnostic Explainable AI techniques such as SHAP and LIME to produce feature-level attributions and human-readable visualizations. These explanations quantify the contribution of each feature to individual predictions, support error analysis, and increase clinician trust, thus bridging the gap between performance and interpretability for real-world clinical use. The concave points feature of the cell nuclei is found to be the most influential feature positively impacting the classification task. This insight can be very helpful in improving the diagnosis and treatment of breast cancer by highlighting the key characteristics of breast tumor.
235. 【2510.21778】Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis
链接:https://arxiv.org/abs/2510.21778
作者:Abdelilah Ganmati,Karim Afdel,Lahcen Koutti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compact binary face, quantify ageing drift, ageing drift directly, binary face templates, modern face CNN
备注: 9 pages, 3 figures, 2 tables
点击查看摘要
Abstract:We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-ITQ into 64- and 128-bit codes. For each identity in AgeDB with at least three distinct ages, we form all genuine pairs and fit a per-identity linear model of Hamming distance versus absolute age gap. Across 566 identities, the median slope is 1.357 bits per decade for 64-bit templates and 2.571 bits per decade for 128-bit templates, with tight non-parametric 95 percent bootstrap confidence intervals. The distributions are predominantly positive, indicating a small but systematic increase in intra-class distance over time. Because drift scales with code length, shorter codes are inherently more age-stable at a fixed decision threshold. We connect these slopes to operating characteristics by reporting EER and TPR at FAR = 1 percent in three age bins. We discuss implications for smart-card and match-on-card deployments, including simple mitigations such as periodic re-enrolment and targeted parity on empirically unstable bit positions. Code and CSV artifacts are provided to support reproducibility.
236. 【2510.21775】Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation
链接:https://arxiv.org/abs/2510.21775
作者:Dawei Dai,Yinxiu Zhou,Chenghang Li,Guolai Jiang,Chengfang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:local semantic instructions, facial image generation, insufficient physical consistency, image generation, semantic instructions
备注:
点击查看摘要
Abstract:In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model's embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.
237. 【2510.21774】OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
链接:https://arxiv.org/abs/2510.21774
作者:Yulong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human-annotated dataset designed, comprehensive human-annotated dataset, developing OCR quality, OCR quality assessment, quality assessment methods
备注:
点击查看摘要
Abstract:We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at this https URL .
238. 【2510.21769】H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows
链接:https://arxiv.org/abs/2510.21769
作者:Harry Zhang,Luca Carlone
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surrounding environment, computer vision, specifically reasoning, critical challenge, challenge in computer
备注:
点击查看摘要
Abstract:Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances -- encompassing contact, orientation, and spatial occupancy -- using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.
239. 【2510.21763】Proportion and Perspective Control for Flow-Based Image Generation
链接:https://arxiv.org/abs/2510.21763
作者:Julien Boudier,Hugo Caselles-Dupré
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diffusion models generate, generate high-fidelity images, offer limited control, models generate high-fidelity, generate high-fidelity
备注: Technical report after open-source release
点击查看摘要
Abstract:While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: this https URL
240. 【2510.21761】J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
链接:https://arxiv.org/abs/2510.21761
作者:Jesse Atuhurra,Hidetaka Kamigaito,Taro Watanabe,Koichiro Yoshino
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Japanese human-robot dialogue, human-robot dialogue scenarios, Japanese human-robot, Vision Language Models, introduce J-ORA
备注: Accepted to IROS2025
点击查看摘要
Abstract:We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object functionality and contextual relationships across different VLMs. These findings underscore the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. See project page at this https URL.
241. 【2510.21757】Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries
链接:https://arxiv.org/abs/2510.21757
作者:Mihir Gupta,Pratik Desai,Ross Greer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Nigeria faces significant, unreliable internet connectivity, expert plant pathologists, faces significant challenges, significant challenges due
备注:
点击查看摘要
Abstract:Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption -- containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations -- through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework's effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.
242. 【2510.21740】Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
链接:https://arxiv.org/abs/2510.21740
作者:Alexa R. Tartaglini,Satchel Grant,Daniel Wurgaft,Christopher Potts,Judith E. Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:data visualization understanding, data visualization, Data, visualization understanding tasks, data points
备注:
点击查看摘要
Abstract:Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.
243. 【2510.21732】A Robotic Stirring Method with Trajectory Optimization and Adaptive Speed Control for Accurate Pest Counting in Water Traps
链接:https://arxiv.org/abs/2510.21732
作者:Xumin Gao,Mark Stevens,Grzegorz Cielniak
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:pest counting, counting, pest, stirring, precision agriculture
备注: This paper has been submitted to ICRA 2026 and is currently under review
点击查看摘要
Abstract:Accurate monitoring of pest population dynamics is crucial for informed decision-making in precision agriculture. Currently, mainstream image-based pest counting methods primarily rely on image processing combined with machine learning or deep learning for pest counting. However, these methods have limitations and struggle to handle situations involving pest occlusion. To address this issue, this paper proposed a robotic stirring method with trajectory optimization and adaptive speed control for accurate pest counting in water traps. First, we developed an automated stirring system for pest counting in yellow water traps based on a robotic arm. Stirring alters the distribution of pests in the yellow water trap, making some of the occluded individuals visible for detection and counting. Then, we investigated the impact of different stirring trajectories on pest counting performance and selected the optimal trajectory for pest counting. Specifically, we designed six representative stirring trajectories, including circle, square, triangle, spiral, four small circles, and random lines, for the robotic arm to stir. And by comparing the overall average counting error and counting confidence of different stirring trajectories across various pest density scenarios, we determined the optimal trajectory. Finally, we proposed a counting confidence-driven closed-loop control system to achieve adaptive-speed stirring. It uses changes in pest counting confidence between consecutive frames as feedback to adjust the stirring speed. To the best of our knowledge, this is the first study dedicated to investigating the effects of different stirring trajectories on object counting in the dynamic liquid environment and to implement adaptive-speed stirring for this type of task. Experimental results show ...
244. 【2510.23561】Revising Second Order Terms in Deep Animation Video Coding
链接:https://arxiv.org/abs/2510.23561
作者:Konstantin Schmidt,Thomas Richter
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Order Motion Model, motion information derived, Order Motion, Motion Model, generative model
备注:
点击查看摘要
Abstract:First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication because first it operates at very low bitrate and second its computational complexity is moderate compared to other learning based video codecs. However, it has strong limitations by design. Since it generates facial animations by warping source-images, it fails to recreate videos with strong head movements. This works concentrates on one specific kind of head movements, namely head rotations. We show that replacing the Jacobian transformations in FOMM by a global rotation helps the system to perform better on items with head-rotations while saving 40% to 80% of bitrate on P-frames. Moreover, we apply state-of-the-art normalization techniques to the discriminator to stabilize the adversarial training which is essential for generating visually appealing videos. We evaluate the performance by the learned metics LPIPS and DISTS to show the success our optimizations.
245. 【2510.22990】USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding
链接:https://arxiv.org/abs/2510.22990
作者:Youssef Megahed,Robin Ducharme,Mark Walker,Steven Hawken,Adrian D. C. Chan
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse clinical domains, offering real-time, diagnostic modalities, radiation-free assessment, widely used diagnostic
备注:
点击查看摘要
Abstract:Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.
246. 【2510.22772】Neural-HAR: A Dimension-Gated CNN Accelerator for Real-Time Radar Human Activity Recognition
链接:https://arxiv.org/abs/2510.22772
作者:Yizhuo Wu,Francesco Fioranelli,Chang Gao
类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
关键词:Radar-based human activity, RNN solutions remain, human activity recognition, exceed practical compute, Radar-based human
备注:
点击查看摘要
Abstract:Radar-based human activity recognition (HAR) is attractive for unobtrusive and privacy-preserving monitoring, yet many CNN/RNN solutions remain too heavy for edge deployment, and even lightweight ViT/SSM variants often exceed practical compute and memory budgets. We introduce Neural-HAR, a dimension-gated CNN accelerator tailored for real-time radar HAR on resource-constrained platforms. At its core is GateCNN, a parameter-efficient Doppler-temporal network that (i) embeds Doppler vectors to emphasize frequency evolution over time and (ii) applies dual-path gated convolutions that modulate Doppler-aware content features with temporal gates, complemented by a residual path for stable training. On the University of Glasgow UoG2020 continuous radar dataset, GateCNN attains 86.4% accuracy with only 2.7k parameters and 0.28M FLOPs per inference, comparable to CNN-BiGRU at a fraction of the complexity. Our FPGA prototype on Xilinx Zynq-7000 Z-7007S reaches 107.5 $\mu$s latency and 15 mW dynamic power using LUT-based ROM and distributed RAM only (zero DSP/BRAM), demonstrating real-time, energy-efficient edge inference. Code and HLS conversion scripts are available at this https URL.
247. 【2510.22760】Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions
链接:https://arxiv.org/abs/2510.22760
作者:Kai Ye,Bowen Liu,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Sensing Image Segmentation, Remote Sensing Image, Referring Image Segmentation, Referring Remote Sensing, Remote Sensing
备注:
点击查看摘要
Abstract:Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.
248. 【2510.22603】Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS
链接:https://arxiv.org/abs/2510.22603
作者:Anand,Umberto Cappellazzo,Stavros Petridis,Maja Pantic
类目:Audio and Speech Processing (eess.AS); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:Large language models, recently advanced auditory, visual speech recognition, speech recognition, advanced auditory speech
备注: The code is available at [this https URL](https://github.com/umbertocappellazzo/Llama-AVSR)
点击查看摘要
Abstract:Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
249. 【2510.22565】Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending
链接:https://arxiv.org/abs/2510.22565
作者:Junsik Jung,Yoonki Cho,Woo Jae Kim,Lin Wang,Sune-eui Yoon
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:dynamic exposure conditions, recover sharp, inputs captured, video frame interpolation, aims to recover
备注: Accepted for BMVC2025
点击查看摘要
Abstract:Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynamic exposure conditions. Event cameras are sensors with high temporal resolution, making them especially advantageous for this task. However, existing event-guided methods struggle to produce satisfactory results on severely low-frame-rate blurry videos due to the lack of temporal constraints. In this paper, we introduce a novel event-guided framework for exposure-agnostic VFI, addressing this limitation through two key components: a Target-adaptive Event Sampling (TES) and a Target-adaptive Importance Mapping (TIM). Specifically, TES samples events around the target timestamp and the unknown exposure time to better align them with the corresponding blurry frames. TIM then generates an importance map that considers the temporal proximity and spatial relevance of consecutive features to the target. Guided by this map, our framework adaptively blends consecutive features, allowing temporally aligned features to serve as the primary cues while spatially relevant ones offer complementary support. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach in exposure-agnostic VFI scenarios.
250. 【2510.22379】raceTrans: Translation and Spatial Tracing for Surgical Prediction
链接:https://arxiv.org/abs/2510.22379
作者:Xiyu Luo,Haodong LI,Xinxing Cheng,He Zhao,Yang Hu,Xuan Song,Tianyang Zhang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:modeling disease progression, achieved notable success, disease progression, achieved notable, notable success
备注:
点击查看摘要
Abstract:Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.
251. 【2510.22166】Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model
链接:https://arxiv.org/abs/2510.22166
作者:Austin A. Barr,Brij S. Karmur,Anthony J. Winder,Eddie Guo,John T. Lysack,James N. Scott,William F. Morrish,Muneer Eesa,Morgan Willson,David W. Cadotte,Michael M.H. Yang,Ian Y.M. Chan,Sanju Lama,Garnette R. Sutherland
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Machine learning, high-quality imaging datasets, assembling large, high-quality imaging, learning in neurosurgery
备注: 10 pages, 4 figures, 1 table
点击查看摘要
Abstract:Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the feasibility of generating realistic lateral cervical spine radiographs using a denoising diffusion probabilistic model (DDPM) trained on 4,963 images from the Cervical Spine X-ray Atlas. Model performance was monitored via training/validation loss and Frechet inception distance, and synthetic image quality was assessed in a blinded "clinical Turing test" with six neuroradiologists and two spine-fellowship trained neurosurgeons. Experts reviewed 50 quartets containing one real and three synthetic images, identifying the real image and rating realism on a 4-point Likert scale. Experts correctly identified the real image in 29% of trials (Fleiss' kappa=0.061). Mean realism scores were comparable between real (3.323) and synthetic images (3.228, 3.258, and 3.320; p=0.383, 0.471, 1.000). Nearest-neighbor analysis found no evidence of memorization. We also provide a dataset of 20,063 synthetic radiographs. These results demonstrate that DDPM-generated cervical spine X-rays are statistically indistinguishable in realism and quality from real clinical images, offering a novel approach to creating large-scale neuroimaging datasets for ML applications in landmarking, segmentation, and classification.
252. 【2510.22154】Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement
链接:https://arxiv.org/abs/2510.22154
作者:Yunhong Tao,Wenbing Tao,Xiang Xiang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Signal Processing (eess.SP)
关键词:Low-light image enhancement, LLIE, aims at improving, poor illumination, improving the perception
备注:
点击查看摘要
Abstract:Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.
253. 【2510.21815】HDR Image Reconstruction using an Unsupervised Fusion Model
链接:https://arxiv.org/abs/2510.21815
作者:Kumbha Nagaswetha
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:High Dynamic Range, brightness levels present, conventional digital cameras, limited dynamic range, High Dynamic
备注:
点击查看摘要
Abstract:High Dynamic Range (HDR) imaging aims to reproduce the wide range of brightness levels present in natural scenes, which the human visual system can perceive but conventional digital cameras often fail to capture due to their limited dynamic range. To address this limitation, we propose a deep learning-based multi-exposure fusion approach for HDR image generation. The method takes a set of differently exposed Low Dynamic Range (LDR) images, typically an underexposed and an overexposed image, and learns to fuse their complementary information using a convolutional neural network (CNN). The underexposed image preserves details in bright regions, while the overexposed image retains information in dark regions; the network effectively combines these to reconstruct a high-quality HDR output. The model is trained in an unsupervised manner, without relying on ground-truth HDR images, making it practical for real-world applications where such data is unavailable. We evaluate our results using the Multi-Exposure Fusion Structural Similarity Index Measure (MEF-SSIM) and demonstrate that our approach achieves superior visual quality compared to existing fusion methods. A customized loss function is further introduced to improve reconstruction fidelity and optimize model performance.




