本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新549篇论文,其中:
- 自然语言处理88篇
- 信息检索10篇
- 计算机视觉99篇
自然语言处理
1. 【2510.20812】Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
链接:https://arxiv.org/abs/2510.20812
作者:Yuhan Liu,Lianhui Qin,Shengjie Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:fine-grained graphical elements, achieved remarkable progress, densely interleave textual, interleave textual annotations, multimodal understanding
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at this https URL
2. 【2510.20810】On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
链接:https://arxiv.org/abs/2510.20810
作者:Mingmeng Geng,Thierry Poibeau
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:large language models, language models, large language, researchers have turned, turned their attention
备注:
点击查看摘要
Abstract:With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
3. 【2510.20809】Real Deep Research for AI, Robotics and Beyond
链接:https://arxiv.org/abs/2510.20809
作者:Xueyan Zou,Jianglong Ye,Hao Zhang,Xiaoyu Xiang,Mingyu Ding,Zhaojing Yang,Yong Jae Lee,Zhuowen Tu,Sifei Liu,Xiaolong Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:rapid growth, increasingly difficult, Real Deep Research, Fast evolving trends, difficult for researchers
备注: website: [this https URL](https://realdeepresearch.github.io)
点击查看摘要
Abstract:With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
4. 【2510.20800】Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
链接:https://arxiv.org/abs/2510.20800
作者:Shiva Sreeram,Alaa Maalouf,Pratyusha Sharma,Daniela Rus
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:pruning high-order components, chosen LLM weight, LLM weight matrices, boost downstream accuracy, method called
备注:
点击查看摘要
Abstract:Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
5. 【2510.20797】Simple Context Compression: Mean-Pooling and Multi-Ratio Training
链接:https://arxiv.org/abs/2510.20797
作者:Yair Feldman,Yoav Artzi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:shorter continuous representation, soft context compression, large language models, multiple compression ratios, compression ratios
备注: Code available at [this https URL](https://github.com/lil-lab/simple-context-compression)
点击查看摘要
Abstract:A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
6. 【2510.20792】BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
链接:https://arxiv.org/abs/2510.20792
作者:Liang Ye,Shengqin Chen,Jiazhu Dai
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
关键词:text-guided graph generation, graph generation, text-guided graph, rapid progress, graph
备注:
点击查看摘要
Abstract:The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
7. 【2510.20787】Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
链接:https://arxiv.org/abs/2510.20787
作者:Mutian He,Philip N. Garner
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:entire input sequence, fixed-size recurrent state, recurrent state offer, finite memory induces, memory induces forgetfulness
备注: 19 pages, 5 figures
点击查看摘要
Abstract:Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
8. 【2510.20782】A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text
链接:https://arxiv.org/abs/2510.20782
作者:Alicia Sagae,Chia-Jung Lee,Sandeep Avula,Brandon Dang,Vanessa Murdock
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, evaluating large language, Current methods, language models, typically focus
备注: 24 pages with 3 figures, to appear in Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)
点击查看摘要
Abstract:Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
9. 【2510.20780】Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
链接:https://arxiv.org/abs/2510.20780
作者:Runzhe Zhan,Zhihong Huang,Xinyi Yang,Lidia S. Chao,Min Yang,Derek F. Wong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large reasoning models, generating final answers, complex downstream tasks, Recent advancements, reasoning models
备注: NeurIPS 2025
点击查看摘要
Abstract:Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
10. 【2510.20743】Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
链接:https://arxiv.org/abs/2510.20743
作者:Lorenzo Stacchio,Andrea Ubaldi,Alessandro Galdelli,Maurizio Mauri,Emanuele Frontoni,Andrea Gaggioli
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, enriches Large Language, Language Model, Large Language, enriches Large
备注:
点击查看摘要
Abstract:We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.
11. 【2510.20727】Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
链接:https://arxiv.org/abs/2510.20727
作者:Xizhi Wu,Madeline S. Kreider,Philip E. Empey,Chenyu Li,Yanshan Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, treatment, Bibliographic Explorer Toggle
备注:
点击查看摘要
Abstract:Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities this http URL and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2510.20727 [cs.CL]
(or
arXiv:2510.20727v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.20727
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yanshan Wang [view email] [v1]
Thu, 23 Oct 2025 16:44:39 UTC (897 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing, by Xizhi Wu and 4 other authorsView PDF
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2025-10
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
12. 【2510.20721】User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
链接:https://arxiv.org/abs/2510.20721
作者:Xiaoyuan Wu,Roshni Kaushik,Wenkai Li,Lujo Bauer,Koichi Onoue
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large language models, Large language, answering health questions, LLMs, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.
13. 【2510.20700】Structure-Conditional Minimum Bayes Risk Decoding
链接:https://arxiv.org/abs/2510.20700
作者:Bryan Eikema,Anna Rutkiewicz,Mario Giulianelli
类目:Computation and Language (cs.CL)
关键词:Minimum Bayes Risk, Minimum Bayes, Bayes Risk, traditional generation strategies, renewed interest
备注: EMNLP 2025 Camera-Ready
点击查看摘要
Abstract:Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model's outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model's distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.
14. 【2510.20690】Neural Diversity Regularizes Hallucinations in Small Models
链接:https://arxiv.org/abs/2510.20690
作者:Kushal Chakrabarti,Nirmal Balachundhar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Language models continue, continue to hallucinate, Language models, neural diversity, models continue
备注:
点击查看摘要
Abstract:Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity -- decorrelated parallel representations -- as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination probability is bounded by representational correlation: $P(H) \leq f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling -- orthogonal to parameters and data -- to improve the reliability of language models at fixed budgets.
15. 【2510.20674】Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE
链接:https://arxiv.org/abs/2510.20674
作者:Rakshith R,Shubham Sharma,Mohammed Sameer Khan,Ankush Chopra
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:e-commerce search system, search system developed, AICOE team, multilingual e-commerce search, multilingual search query
备注:
点击查看摘要
Abstract:This study presents the multilingual e-commerce search system developed by the Tredence_AICOE team. The competition features two multilingual relevance tasks: Query-Category (QC) Relevance, which evaluates how well a user's search query aligns with a product category, and Query-Item (QI) Relevance, which measures the match between a multilingual search query and an individual product listing. To ensure full language coverage, we performed data augmentation by translating existing datasets into languages missing from the development set, enabling training across all target languages. We fine-tuned Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies. The Gemma-3 12B (4-bit) model achieved the best QC performance using original and translated data, and the best QI performance using original, translated, and minority class data creation. These approaches secured 4th place on the final leaderboard, with an average F1-score of 0.8857 on the private test set.
16. 【2510.20670】\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
链接:https://arxiv.org/abs/2510.20670
作者:Junghyun Min,York Hay Ng,Sophia Chan,Helena Shunhua Zhao,En-Shiun Annie Lee
类目:Computation and Language (cs.CL)
关键词:remains under-resourced due, Cantonese, spoken by millions, policy and diglossia, under-resourced due
备注: 13 pages, 1 figure
点击查看摘要
Abstract:Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
17. 【2510.20647】he Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
链接:https://arxiv.org/abs/2510.20647
作者:Alan Saji,Raj Dabre,Anoop Kunchukuttan,Ratish Puduppully
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Reasoning Models, abilities remain underexplored, multilingual reasoning abilities, reasoning abilities remain, achieve strong performance
备注: 14 pages, 13 figures, 5 tables
点击查看摘要
Abstract:Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.
18. 【2510.20635】Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model
链接:https://arxiv.org/abs/2510.20635
作者:Haoyu Wang,Sihang Jiang,Yuyan Chen,Yitong Wang,Yanghua Xiao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:pivotal conduit, discover and learn, Curiosity, Thrill Seeking, Information Seeking
备注:
点击查看摘要
Abstract:Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model's reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.
19. 【2510.20610】BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
链接:https://arxiv.org/abs/2510.20610
作者:Ali Zain,Sareem Farooqui,Muhammad Rafi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:GenEval Shared Task, GenEval Shared, Shared Task, paper details, details our submission
备注:
点击查看摘要
Abstract:This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.
20. 【2510.20603】What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation
链接:https://arxiv.org/abs/2510.20603
作者:Heejin Do,Jaehui Hwang,Dongyoon Han,Seong Joon Oh,Sangdoo Yun
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Evaluating large language, Evaluating large, large language models, dominant paradigm, large language
备注:
点击查看摘要
Abstract:Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.
21. 【2510.20584】Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks
链接:https://arxiv.org/abs/2510.20584
作者:Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:labor intensive task, Assessing communication, scale depends, labor intensive, Assessing
备注: 38 pages, 4 figures
点击查看摘要
Abstract:Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.
22. 【2510.20567】Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search
链接:https://arxiv.org/abs/2510.20567
作者:Zhouwei Zhai,Mengxiang Chen,Haoyun Xia,Jin Li,Renquan Zhou,Min Yang
类目:Computation and Language (cs.CL)
关键词:query-item matching fundamentally, matching fundamentally misaligns, long dominated e-commerce, multi-stage cognitive decision, cognitive decision processes
备注:
点击查看摘要
Abstract:The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF's significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.
23. 【2510.20548】GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
链接:https://arxiv.org/abs/2510.20548
作者:Jinchang Luo,Mingquan Cheng,Fan Wan,Ni Li,Xiaoling Xia,Shuangshuang Tian,Tingcheng Bian,Haiwei Wang,Haohuan Fu,Yan Tao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:improving retrieval-augmented generation, recently shown promise, retrieval-augmented generation, recently shown, shown promise
备注: 8 pages, 3 figures, 4 tables
点击查看摘要
Abstract:Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
24. 【2510.20543】he Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
链接:https://arxiv.org/abs/2510.20543
作者:Sangmitra Madhusudan,Kaige Chen,Ali Emami
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:models correctly parse, correctly parse, dogs chasing cats, language models correctly, analyzing syntax
备注:
点击查看摘要
Abstract:When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
25. 【2510.20535】ARC-Encoder: learning compressed text representations for large language models
链接:https://arxiv.org/abs/2510.20535
作者:Hippolyte Pilchen,Edouard Grave,Patrick Pérez
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Recent techniques, retrieval-augmented generation, increased inference costs, Recent, costs
备注:
点击查看摘要
Abstract:Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs $x$-times fewer continuous representations (typically $x\!\in\!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at this https URL , fine-tuning dataset and pretrained models are available at this https URL .
26. 【2510.20513】Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment
链接:https://arxiv.org/abs/2510.20513
作者:Zhiyu Lin,Jingwen Yang,Jiale Zhao,Meng Liu,Sunzhu Li,Benyou Wang
类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:models generate intelligible, reliable evaluation metric, generate intelligible speech, lack natural expressiveness, largely due
备注: Submitted to ICASSP 2026. Demos and codes are available at [this https URL](https://github.com/FreedomIntelligence/ExpressiveSpeech)
点击查看摘要
Abstract:Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at this https URL
27. 【2510.20508】Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
链接:https://arxiv.org/abs/2510.20508
作者:Paul Lerner,François Yvon
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, English surveys, answers to English
备注:
点击查看摘要
Abstract:The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.
28. 【2510.20505】Hierarchical Sequence Iteration for Heterogeneous Question Answering
链接:https://arxiv.org/abs/2510.20505
作者:Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Heterogeneous Question Answering, heterogeneous evidence sources, Retrieval-augmented generation, remains brittle, brittle on multi-step
备注: 22 pages, 3 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.
29. 【2510.20498】Robust Preference Alignment via Directional Neighborhood Consensus
链接:https://arxiv.org/abs/2510.20498
作者:Ruochen Mao,Yuling Shi,Xiaodong Gu,Jiaheng Wei
类目:Computation and Language (cs.CL)
关键词:Aligning large language, Aligning large, large language models, controllable AI systems, large language
备注: Under review at ICLR 2026. 10 pages, 5 figures. Code and data available at [this https URL](https://github.com/rcmao/robust-preference-alignment)
点击查看摘要
Abstract:Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
30. 【2510.20487】Steering Evaluation-Aware Language Models To Act Like They Are Deployed
链接:https://arxiv.org/abs/2510.20487
作者:Tim Tian Hua,Andrew Qin,Samuel Marks,Neel Nanda
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, Python type hints, evaluated and adjust, Large
备注:
点击查看摘要
Abstract:Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
31. 【2510.20479】RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging
链接:https://arxiv.org/abs/2510.20479
作者:Bowen Wang,Haiyuan Wan,Liwen Shi,Chen Yang,Peng He,Yue Ma,Haochen Han,Wenhao Li,Tiao Tan,Yongjian Li,Fangming Liu,Yifan Gong,Sheng Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:representation-aware model merging, model merging framework, large language models, serve as reliable, historical data
备注:
点击查看摘要
Abstract:We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
32. 【2510.20475】Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs
链接:https://arxiv.org/abs/2510.20475
作者:Lukas Edman,Alexander Fraser
类目:Computation and Language (cs.CL)
关键词:Masked Language Modeling, BabyLM Challenge, describe our strategy, Language Modeling, Masked Language
备注: Submission to the 2025 BabyLM Challenge
点击查看摘要
Abstract:We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model's ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model's morphological generalization capabilities. Our submission beats the baseline in the strict-small track.
33. 【2510.20460】Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
链接:https://arxiv.org/abs/2510.20460
作者:Christian Hobelsberger,Theresa Winner,Andreas Nawroth,Oliver Mitevski,Anna-Carolina Haensch
类目:Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
关键词:Large language models, varying levels, Large language, levels of correctness, making their practical
备注:
点击查看摘要
Abstract:Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
34. 【2510.20449】LM-mixup: Text Data Augmentation via Language Model based Mixup
链接:https://arxiv.org/abs/2510.20449
作者:Zhijie Deng,Zhouan Shen,Ling Li,Yao Zhou,Zhaowei Zhu,Yanji He,Wei Wang,Jiaheng Wei
类目:Computation and Language (cs.CL)
关键词:Large Language Models, aligning Large Language, Language Models, Large Language, aligning Large
备注:
点击查看摘要
Abstract:Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
35. 【2510.20411】acher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction
链接:https://arxiv.org/abs/2510.20411
作者:Suchir Salhan,Hongyi Gu,Donya Rooein,Diana Galvan-Sosa,Gabrielle Gaudeau,Andrew Caines,Zheng Yuan,Paula Buttery
类目:Computation and Language (cs.CL)
关键词:property called contingency, exchanges between interlocutors, caregiver are characterized, property called, meaningful exchanges
备注: Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral presentation, Suzhou, China
点击查看摘要
Abstract:Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.
36. 【2510.20387】Relative-Based Scaling Law for Neural Language Models
链接:https://arxiv.org/abs/2510.20387
作者:Baoqing Yue,Jinyuan Zhou,Zixi Wei,Jingtao Zhan,Qingyao Ai,Yiqun Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:accurately predict model, Scaling laws aim, Relative-Based Scaling Law, aim to accurately, accurately predict
备注:
点击查看摘要
Abstract:Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.
37. 【2510.20386】NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
链接:https://arxiv.org/abs/2510.20386
作者:Shaltiel Shmidman,Avi Shmidman,Moshe Koppel
类目:Computation and Language (cs.CL)
关键词:demonstrated exceptional performance, BERT models, demonstrated exceptional, exceptional performance, BERT
备注:
点击查看摘要
Abstract:Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.
38. 【2510.20381】VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation
链接:https://arxiv.org/abs/2510.20381
作者:Son T. Luu,Trung Vo,Hiep Nguyen,Khanh Quoc Tran,Kiet Van Nguyen,Vu Tran,Ngan Luu-Thuy Nguyen,Le-Minh Nguyen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:regulation shared task, multimodal question answering, traffic sign regulation, multimodal legal, sign regulation shared
备注: VLSP 2025 MLQA-TSR Share Task
点击查看摘要
Abstract:This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.
39. 【2510.20377】IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation
链接:https://arxiv.org/abs/2510.20377
作者:Tianyi Zhang,Florian Mai,Lucie Flek
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:unlabeled test-time data, Continual pretraining promises, adapt large language, naively applying standard, large language models
备注:
点击查看摘要
Abstract:Continual pretraining promises to adapt large language models (LLMs) to new domains using only unlabeled test-time data, but naively applying standard self-supervised objectives to instruction-tuned models is known to degrade their instruction-following capability and semantic representations. Existing fixes assume access to the original base model or rely on knowledge from an external domain-specific database - both of which pose a realistic barrier in settings where the base model weights are withheld for safety reasons or reliable external corpora are unavailable. In this work, we propose Instruction-Knowledge-Aware Continual Adaptation (IKnow), a simple and general framework that formulates novel self-supervised objectives in the instruction-response dialogue format. Rather than depend- ing on external resources, IKnow leverages domain knowledge embedded within the text itself and learns to encode it at a deeper semantic level.
40. 【2510.20375】he Impact of Negated Text on Hallucination with Large Language Models
链接:https://arxiv.org/abs/2510.20375
作者:Jaehyung Seo,Hyeonseok Moon,Heuiseok Lim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, natural language processing, Recent studies, language models, language processing
备注: Accepted to the EMNLP 2025
点击查看摘要
Abstract:Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.
41. 【2510.20358】Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
链接:https://arxiv.org/abs/2510.20358
作者:Francesca Padovani,Bastian Bunzeck,Manar Ali,Omar Momen,Arianna Bisazza,Hendrik Buschmeier,Sina Zarrieß
类目:Computation and Language (cs.CL)
关键词:functionally apt small, apt small language, small language models, dialogue data results, investigate whether pre-training
备注:
点击查看摘要
Abstract:We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce "more communicative" text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.
42. 【2510.20356】FreeChunker: A Cross-Granularity Chunking Framework
链接:https://arxiv.org/abs/2510.20356
作者:Wenxuan Zhang,Yuan-Hao Jiang,Yonghe Wu
类目:Computation and Language (cs.CL)
关键词:Retrieval-Augmented Generation, strategies significantly impact, impact the effectiveness, effectiveness of Retrieval-Augmented, RAG
备注: Submitted to arXiv, October 2025
点击查看摘要
Abstract:Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.
43. 【2510.20351】Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
链接:https://arxiv.org/abs/2510.20351
作者:Matteo Silvestri,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, structured data, crucial confound
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.
44. 【2510.20342】aching Language Models to Reason with Tools
链接:https://arxiv.org/abs/2510.20342
作者:Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:shown impressive capabilities, Large reasoning models, Large reasoning, shown impressive, impressive capabilities
备注: NIPS2025 Accepted
点击查看摘要
Abstract:Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model's internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT's effectiveness, yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30\% for the 32B model and 50\% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: this https URL.
45. 【2510.20304】Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
链接:https://arxiv.org/abs/2510.20304
作者:Lei Tang,Wei Zhou,Mohsen Mesgar
类目:Computation and Language (cs.CL)
关键词:Process reward models, large language models, Process reward, improve complex reasoning, aggregated step scores
备注:
点击查看摘要
Abstract:Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.
46. 【2510.20303】Citation Failure: Definition, Analysis and Efficient Mitigation
链接:https://arxiv.org/abs/2510.20303
作者:Jan Buchmann,Iryna Gurevych
类目:Computation and Language (cs.CL)
关键词:LLM-based RAG systems, LLM-based RAG, RAG systems, simplify response verification, systems are supposed
备注: Under review. Paper repository: [this https URL](https://github.com/UKPLab/arxiv2025-citation-failure)
点击查看摘要
Abstract:Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.
47. 【2510.20280】Context-level Language Modeling by Learning Predictive Context Embeddings
链接:https://arxiv.org/abs/2510.20280
作者:Beiya Dai,Yuliang Liu,Daozheng Xue,Qipeng Guo,Kai Chen,Xinbing Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:modern large language, Next-token prediction, driving their unprecedented, text generation, cornerstone of modern
备注: 16pages,6 figures
点击查看摘要
Abstract:Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
48. 【2510.20270】ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases
链接:https://arxiv.org/abs/2510.20270
作者:Ziqian Zhong,Aditi Raghunathan,Nicholas Carlini
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:poses significant risks, complete tasks poses, tasks poses significant, tendency to find, poses significant
备注:
点击查看摘要
Abstract:The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at this https URL.
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2510.20270 [cs.LG]
(or
arXiv:2510.20270v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2510.20270
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Ziqian Zhong [view email] [v1]
Thu, 23 Oct 2025 06:58:32 UTC (7,152 KB)
49. 【2510.20256】Calibrating Multimodal Consensus for Emotion Recognition
链接:https://arxiv.org/abs/2510.20256
作者:Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:made substantial progress, Multimodal Emotion Recognition, Emotion Recognition, Multimodal Emotion, recent years
备注:
点击查看摘要
Abstract:In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at this https URL.
50. 【2510.20239】ri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
链接:https://arxiv.org/abs/2510.20239
作者:Filippo Cenacchi,Deborah Richards,Longbing Cao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:post traumatic stress, complicating automated assessment, traumatic stress disorder, connected symptoms, complicating automated
备注:
点击查看摘要
Abstract:Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.
51. 【2510.20229】Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
链接:https://arxiv.org/abs/2510.20229
作者:Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, made significant progress, progress in recent
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.
52. 【2510.20208】Decoding-Free Sampling Strategies for LLM Marginalization
链接:https://arxiv.org/abs/2510.20208
作者:David Pohl,Marco Cognetta,Junyoung Lee,Naoaki Okazaki
类目:Computation and Language (cs.CL)
关键词:Modern language models, Modern language, language models operate, operate on subword-tokenized, order to make
备注: 10 pages, 3 figures
点击查看摘要
Abstract:Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks.
Comments:
10 pages, 3 figures
Subjects:
Computation and Language (cs.CL)
ACMclasses:
I.2.7
Cite as:
arXiv:2510.20208 [cs.CL]
(or
arXiv:2510.20208v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.20208
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2510.20198】Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models
链接:https://arxiv.org/abs/2510.20198
作者:Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:spatial reasoning capability, computational abilities, spatial reasoning, capability of large, textual input
备注: 20 pages, 24 figures
点击查看摘要
Abstract:This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.
54. 【2510.20193】Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
链接:https://arxiv.org/abs/2510.20193
作者:Rahul Raja,Arpita Vats
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Question Answering, structured text data, structured metadata, structured text, traditionally relied
备注: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. [this https URL](https://doi.org/10.1145/3746274.3760393)
点击查看摘要
Abstract:Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
55. 【2510.20187】Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
链接:https://arxiv.org/abs/2510.20187
作者:Dian Yu,Yulai Zhao,Kishan Panaganti,Linfeng Song,Haitao Mi,Dong Yu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:aligns Large Language, Large Language Model, propose Reinforcement Learning, Large Language, Reinforcement Learning
备注: 15 pages, 4 figures
点击查看摘要
Abstract:We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
56. 【2510.20176】Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding
链接:https://arxiv.org/abs/2510.20176
作者:Yuhang Zhou,Mingrui Zhang,Ke Li,Mingyi Wang,Qiao Liu,Qifei wang,Jiayi Liu,Fei Liu,Serena Li,Weiwi Li,Mingze Gao,Abhishek Kumar,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:real-world applications, critical capability, table, Carlo Tree Search, Abstract
备注: 18 pages, 4 figures
点击查看摘要
Abstract:Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.
57. 【2510.20168】DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking
链接:https://arxiv.org/abs/2510.20168
作者:Tian Lan,Bin Zhu,Qianghuai Jia,Junyang Ren,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang
类目:Computation and Language (cs.CL)
关键词:collection-a critical deficiency, scale information collection-a, information collection-a critical, comprehensive market analysis, simultaneously perform
备注:
点击查看摘要
Abstract:Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
58. 【2510.20154】Are Stereotypes Leading LLMs' Zero-Shot Stance Detection ?
链接:https://arxiv.org/abs/2510.20154
作者:Anthony Dubreuil,Antoine Gourru,Christine Largeron,Amine Trabelsi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, Language Processing tasks, Large Language Models, Natural Language, Language Processing
备注: Accepted in EMNLP 2025 (Main)
点击查看摘要
Abstract:Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model's stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.
59. 【2510.20151】BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
链接:https://arxiv.org/abs/2510.20151
作者:Haoyuan Li,Zhengyuan Shen,Sullam Jeoung,Yueyan Chen,Jiayu Li,Qi Zhu,Shuai Wang,Vassilis Ioannidis,Huzefa Rangwala
类目:Computation and Language (cs.CL)
关键词:semantically meaningful components, diverse domains, components becomes critical, technical reports, reports to generative
备注:
点击查看摘要
Abstract:As structured texts become increasingly complex across diverse domains -- from technical reports to generative AI prompts -- the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL's effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
60. 【2510.20099】AI PB: A Grounded Generative Agent for Personalized Investment Insights
链接:https://arxiv.org/abs/2510.20099
作者:Daewoo Park,Suho Park,Inseok Hong,Hanwool Lee,Junkyu Park,Sangjun Lee,Jeongman An,Hyunbin Loh
类目:Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
关键词:production-scale generative agent, generative agent deployed, real retail finance, production-scale generative, generative agent
备注: Under Review
点击查看摘要
Abstract:We present AI PB, a production-scale generative agent deployed in real retail finance. Unlike reactive chatbots that answer queries passively, AI PB proactively generates grounded, compliant, and user-specific investment insights. It integrates (i) a component-based orchestration layer that deterministically routes between internal and external LLMs based on data sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the finance-domain embedding model, and (iii) a multi-stage recommendation mechanism combining rule heuristics, sequential behavioral modeling, and contextual bandits. Operating fully on-premises under Korean financial regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100 GPUs. Through human QA and system metrics, we demonstrate that grounded generation with explicit routing and layered safety can deliver trustworthy AI insights in high-stakes finance.
61. 【2510.20098】Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
链接:https://arxiv.org/abs/2510.20098
作者:Yajie Li,Albert Galimov,Mitra Datta Ganapaneni,Pujitha Thejaswi,De Meng,Priyanshu Kumar,Saloni Potdar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Entity Linking, extensive model fine-tuning, large annotated datasets, traditionally relied, Adaptive Routing
备注:
点击查看摘要
Abstract:Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
62. 【2510.20095】BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
链接:https://arxiv.org/abs/2510.20095
作者:Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:work investigates descriptive, work investigates, additional source, captions, investigates descriptive captions
备注: Project page: [this https URL](https://imageomics.github.io/biocap/)
点击查看摘要
Abstract:This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
63. 【2510.20091】CreativityPrism: A Holistic Benchmark for Large Language Model Creativity
链接:https://arxiv.org/abs/2510.20091
作者:Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:human intelligence, hallmark of human, Creativity, evaluation, domains
备注:
点击查看摘要
Abstract:Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
64. 【2510.20075】LLMs can hide text in other text of the same length.ipynb
链接:https://arxiv.org/abs/2510.20075
作者:Antonio Norelli,Michael Bronstein
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:coherent and plausible, Large Language Models, hidden inside, Large Language, meaningful text
备注: 21 pages, main paper 9 pages
点击查看摘要
Abstract:A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
65. 【2510.20059】Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training
链接:https://arxiv.org/abs/2510.20059
作者:Mehrdad Ghassabi,Sadra Hakim,Hamidreza Baradaran Kashani,Pedram Rostami
类目:Computation and Language (cs.CL)
关键词:medical question answering, employ Reinforcement Learning, question answering, critical for specialized, specialized applications
备注: 6 pages, 4 figures
点击查看摘要
Abstract:Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
66. 【2510.20043】From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge
链接:https://arxiv.org/abs/2510.20043
作者:Nafis Chowdhury,Moinul Haque,Anika Ahmed,Nazia Tasnim,Md. Istiak Hossain Shihab,Sajjadur Rahman,Farig Sadeque
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:demonstrated remarkable capabilities, progress in NLP, NLP research, range of tasks, research has demonstrated
备注: 4 pages
点击查看摘要
Abstract:Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.
67. 【2510.20039】Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions
链接:https://arxiv.org/abs/2510.20039
作者:Yuyang Jiang,Longjie Guo,Yuchen Wu,Aylin Caliskan,Tanu Mitra,Hua Shen
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large language model, Large language, language model, Large, LLM
备注: 26 pages, 8 figures
点击查看摘要
Abstract:Large language model (LLM)-powered chatbots are increasingly used for opinion exploration. Prior research examined how LLMs alter user views, yet little work extended beyond one-way influence to address how user input can affect LLM responses and how such bi-directional influence manifests throughout the multi-turn conversations. This study investigates this dynamic through 50 controversial-topic discussions with participants (N=266) across three conditions: static statements, standard chatbot, and personalized chatbot. Results show that human opinions barely shifted, while LLM outputs changed more substantially, narrowing the gap between human and LLM stance. Personalization amplified these shifts in both directions compared to the standard setting. Analysis of multi-turn conversations further revealed that exchanges involving participants' personal stories were most likely to trigger stance changes for both humans and LLMs. Our work highlights the risk of over-alignment in human-LLM interaction and the need for careful design of personalized chatbots to more thoughtfully and stably align with users.
68. 【2510.20036】oolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering
链接:https://arxiv.org/abs/2510.20036
作者:Marianne Menglin Liu,Daniel Garcia,Fjona Parllaku,Vikas Upadhyay,Syed Fahad Allam Shah,Dan Roth
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:solve complex tasks, Large language model, language model, agents rely, complex tasks
备注: Preprint under review
点击查看摘要
Abstract:Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope's effectiveness in enhancing LLM tool use.
69. 【2510.20033】Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models
链接:https://arxiv.org/abs/2510.20033
作者:David Dukić
类目:Computation and Language (cs.CL)
关键词:doctoral thesis improves, large language models, autoregressive large language, sequence labeling tasks, language models
备注:
点击查看摘要
Abstract:This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model's architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.
70. 【2510.20002】Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training
链接:https://arxiv.org/abs/2510.20002
作者:Alexandra Apostolopoulou,Konstantinos Kanaris,Athanasios Koursaris,Dimitris Tsakalidis,George Domalis,Ioannis E. Livieris
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:fragmented research landscape, limited context-length models, natural language processing, Greek Embedding Models, morphologically rich
备注:
点击查看摘要
Abstract:The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.
71. 【2510.20001】Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs
链接:https://arxiv.org/abs/2510.20001
作者:Yunpeng Xiao,Carl Yang,Mark Mai,Xiao Hu,Kai Shu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, language models, show promise, clinical
备注: 13 pages, 3 figures
点击查看摘要
Abstract:Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.
72. 【2510.19996】A Fundamental Algorithm for Dependency Parsing (With Corrections)
链接:https://arxiv.org/abs/2510.19996
作者:Michael A. Covington
类目:Computation and Language (cs.CL)
关键词:natural language sentences, dependency trees, paper presents, presents a fundamental, sentences into dependency
备注: Corrected version of an already widely cited paper
点击查看摘要
Abstract:This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is $O(n^3)$, but in human language, the worst case occurs only for small $n$.
73. 【2510.19995】Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication
链接:https://arxiv.org/abs/2510.19995
作者:Yiming Lu,Xun Wang,Simin Ma,Shujian Liu,Sathish Reddy Indurthi,Song Wang,Haoyun Deng,Fei Liu,Kaiqiang Song
类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL)
关键词:current multi-agent LLM, LLM systems lack, Teamwork in workspace, lack systematic frameworks, diverse communication strategies
备注: 13 pages
点击查看摘要
Abstract:Teamwork in workspace for complex tasks requires diverse communication strategies, but current multi-agent LLM systems lack systematic frameworks for task oriented communication. We introduce Communication to Completion (C2C), a scalable framework that addresses this gap through two key innovations: (1) the Alignment Factor (AF), a novel metric quantifying agent task alignment that directly impacts work efficiency, and (2) a Sequential Action Framework that integrates stepwise execution with intelligent communication decisions. C2C enables agents to make cost aware communication choices, dynamically improving task understanding through targeted interactions. We evaluated C2C on realistic coding workflows across three complexity tiers and team sizes from 5 to 17 agents, comparing against no communication and fixed steps baselines. The results show that C2C reduces the task completion time by about 40% with acceptable communication costs. The framework completes all tasks successfully in standard configurations and maintains effectiveness at scale. C2C establishes both a theoretical foundation for measuring communication effectiveness in multi-agent systems and a practical framework for complex collaborative tasks.
74. 【2510.19988】LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation
链接:https://arxiv.org/abs/2510.19988
作者:Xin Lian,Kenneth D. Forbus
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:probabilistic inference makes, inconsistent output structure, large language models, symbolic NLU systems, symbolic NLU
备注: 18 pages, 2 figures
点击查看摘要
Abstract:Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.
75. 【2510.19967】LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation
链接:https://arxiv.org/abs/2510.19967
作者:Le Ren,Xiangjian Zeng,Qingqiang Wu,Ruoxuan Liang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:multiple musical constraints, requires balancing multiple, balancing multiple musical, musical constraints, requires balancing
备注: submitted to ICASSP 2026
点击查看摘要
Abstract:Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at this https URL.
76. 【2510.19897】Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
链接:https://arxiv.org/abs/2510.19897
作者:Jackson Hassell,Dan Zhang,Hannah Kim,Tom Mitchell,Estevam Hruschka
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:pretrained large language, learn target classification, target classification functions, large language models, parameter updates
备注: 11 pages
点击查看摘要
Abstract:We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.
77. 【2510.19895】Large Language Model enabled Mathematical Modeling
链接:https://arxiv.org/abs/2510.19895
作者:Guoyun Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, optimization modeling offers, integration of Large, Large Language, modeling offers
备注:
点击查看摘要
Abstract:The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.
78. 【2510.19892】Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
链接:https://arxiv.org/abs/2510.19892
作者:Nishant Balepur,Dang Nguyen,Dayeon Ki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multi-modal large language, model pairwise comparisons, exploit superficial shortcuts, large language models, assess MLM capabilities
备注: Accepted as a Spotlight paper at the EMNLP 2025 Wordplay Workshop
点击查看摘要
Abstract:Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks -- which cannot jointly assess MLM capabilities in a single task -- or rely on human or model pairwise comparisons -- which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.
79. 【2510.19886】An Expert-grounded benchmark of General Purpose LLMs in LCA
链接:https://arxiv.org/abs/2510.19886
作者:Artur Donaldson,Bharathan Balaji,Cajetan Oriekezie,Manish Kumar,Laure Patouillard
类目:Computation and Language (cs.CL)
关键词:life cycle assessment, support life cycle, Artificial intelligence, cycle assessment, increasingly being explored
备注:
点击查看摘要
Abstract:Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs naïvely in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents ...
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2510.19886 [cs.CL]
(or
arXiv:2510.19886v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.19886
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Artur Donaldson [view email] [v1]
Wed, 22 Oct 2025 15:56:54 UTC (1,149 KB)
80. 【2510.19879】Automated HIV Screening on Dutch EHR with Large Language Models
链接:https://arxiv.org/abs/2510.19879
作者:Lang Zhou,Amrish Jhingoer,Yinghao Luo,Klaske Vliegenthart--Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li
类目:Computation and Language (cs.CL)
关键词:reducing onward transmission, Electronic Health Records, Efficient screening, onward transmission, screening and early
备注: 28 pages, 6 figures
点击查看摘要
Abstract:Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient's eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.
81. 【2510.19875】Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
链接:https://arxiv.org/abs/2510.19875
作者:J Rosser,José Luis Redondo García,Gustavo Penha,Konstantina Palla,Hugues Bouchard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, traditional Mechanistic Interpretability, traditional Mechanistic, Mechanistic Interpretability techniques
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model's next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99\% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96\% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at this https URL.
82. 【2510.19871】From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
链接:https://arxiv.org/abs/2510.19871
作者:Yatai Ji,Teng Wang,Yuying Ge,Zhiheng Liu,Sidi Yang,Ying Shan,Ping Luo
类目:Computation and Language (cs.CL)
关键词:offering bidirectional context, bidirectional context modeling, Discrete diffusion models, Discrete diffusion, vision-language tasks
备注:
点击查看摘要
Abstract:Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert's corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at this https URL.
83. 【2510.19866】An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics
链接:https://arxiv.org/abs/2510.19866
作者:Xincheng Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:lesson plans, Claude Sonnet, Toggle, Code Toggle Papers, leading large language
备注: 20 pages, 6 tables
点击查看摘要
Abstract:This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.
Comments:
20 pages, 6 tables
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACMclasses:
G.1.10; G.4; I.2.6; I.2.7
Cite as:
arXiv:2510.19866 [cs.CL]
(or
arXiv:2510.19866v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2510.19866
Focus to learn more
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.35542/osf.io/r3xkt_v1
Focus to learn more
DOI(s) linking to related resources
Submission history From: Xincheng Liu [view email] [v1]
Wed, 22 Oct 2025 02:53:06 UTC (323 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics, by Xincheng LiuView PDF
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2025-10
Change to browse by:
cs
cs.AI
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
84. 【2510.19864】SODBench: A Large Language Model Approach to Documenting Spreadsheet Operations
链接:https://arxiv.org/abs/2510.19864
作者:Amila Indika,Igor Molybog
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Numerous knowledge workers, knowledge workers utilize, Numerous knowledge, workers utilize spreadsheets, workers utilize
备注: 14 pages, 5 figures, 4 tables
点击查看摘要
Abstract:Numerous knowledge workers utilize spreadsheets in business, accounting, and finance. However, a lack of systematic documentation methods for spreadsheets hinders automation, collaboration, and knowledge transfer, which risks the loss of crucial institutional knowledge. This paper introduces Spreadsheet Operations Documentation (SOD), an AI task that involves generating human-readable explanations from spreadsheet operations. Many previous studies have utilized Large Language Models (LLMs) for generating spreadsheet manipulation code; however, translating that code into natural language for SOD is a less-explored area. To address this, we present a benchmark of 111 spreadsheet manipulation code snippets, each paired with a corresponding natural language summary. We evaluate five LLMs, GPT-4o, GPT-4o-mini, LLaMA-3.3-70B, Mixtral-8x7B, and Gemma2-9B, using BLEU, GLEU, ROUGE-L, and METEOR metrics. Our findings suggest that LLMs can generate accurate spreadsheet documentation, making SOD a feasible prerequisite step toward enhancing reproducibility, maintainability, and collaborative workflows in spreadsheets, although there are challenges that need to be addressed.
85. 【2510.19858】DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse
链接:https://arxiv.org/abs/2510.19858
作者:Jindi Wang,Yidi Zhang,Zhaoxing Li
类目:Computation and Language (cs.CL)
关键词:study presents DeBERTa-KC, textit, levels in online, study presents, automatic classification
备注:
点击查看摘要
Abstract:This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022--2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textit{nonKC}, \textit{Share}, \textit{Explore}, and \textit{Negotiate}. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of $0.836 \pm 0.008$, significantly out-performing both classical and transformer baselines ($p0.01$). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textit{Explore} and \textit{Negotiate} discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.
86. 【2510.19850】Prompt Decorators: A Declarative and Composable Syntax for Reasoning, Formatting, and Control in LLMs
链接:https://arxiv.org/abs/2510.19850
作者:Mostapha Kalami Heris
类目:Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, Large Language, users lack consistent, lack consistent control, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic="Systems Thinking"). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive Generative and Expressive Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.
87. 【2510.19838】Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory
链接:https://arxiv.org/abs/2510.19838
作者:Shiqi He,Yue Cui,Xinyu Ma,Yaliang Li,Bolin Ding,Mosharaf Chowdhury
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, show strong potential, Autonomous web agents, performing goal-oriented tasks, report generation
备注:
点击查看摘要
Abstract:Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.
88. 【2510.20728】Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
链接:https://arxiv.org/abs/2510.20728
作者:Xi He,Sirui Lu,Bei Zeng
类目:Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Mathematical Physics (math-ph)
关键词:transversal diagonal gates, co-designs quantum codes, Subset-Sum Linear Programming, prescribed transversal diagonal, diagonal gates
备注: 29 pages, 2 figures
点击查看摘要
Abstract:We present a multi-agent, human-in-the-loop workflow that co-designs quantum codes with prescribed transversal diagonal gates. It builds on the Subset-Sum Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL) equalities via small LPs. The workflow is powered by GPT-5 and implemented within TeXRA (this https URL)-a multi-agent research assistant platform that supports an iterative tool-use loop agent and a derivation-then-edit workflow reasoning agent. We work in a LaTeX-Python environment where agents reason, edit documents, execute code, and synchronize their work to Git/Overleaf. Within this workspace, three roles collaborate: a Synthesis Agent formulates the problem; a Search Agent sweeps/screens candidates and exactifies numerics into rationals; and an Audit Agent independently checks all KL equalities and the induced logical action. As a first step we focus on distance $d=2$ with nondegenerate residues. For code dimension $K\in\{2,3,4\}$ and $n\le6$ qubits, systematic sweeps yield certificate-backed tables cataloging attainable cyclic logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$ at $n=6$. From verified instances, Synthesis Agent abstracts recurring structures into closed-form families and proves they satisfy the KL equalities for all parameters. It further demonstrates that SSLP accommodates residue degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts diagonal-transversal feasibility as an analytical pipeline executed at scale, combining systematic enumeration with exact analytical reconstruction. It yields reproducible code constructions, supports targeted extensions to larger $K$ and higher distances, and leads toward data-driven classification.
信息检索
1. 【2510.20815】Generative Reasoning Recommendation via LLMs
链接:https://arxiv.org/abs/2510.20815
作者:Minjie Hong,Zetong Zhou,Zirun Guo,Ziang Zhang,Ruofan Hu,Weinan Gan,Jieming Zhu,Zhou Zhao
类目:Information Retrieval (cs.IR)
关键词:large language models, presents significant obstacles, face fundamental challenges, reasoning recommendation models, remarkable reasoning capabilities
备注:
点击查看摘要
Abstract:Despite their remarkable reasoning capabilities across diverse domains, large language models (LLMs) face fundamental challenges in natively functioning as generative reasoning recommendation models (GRRMs), where the intrinsic modeling gap between textual semantics and collaborative filtering signals, combined with the sparsity and stochasticity of user feedback, presents significant obstacles. This work explores how to build GRRMs by adapting pre-trained LLMs, which achieves a unified understanding-reasoning-prediction manner for recommendation tasks. We propose GREAM, an end-to-end framework that integrates three components: (i) Collaborative-Semantic Alignment, which fuses heterogeneous textual evidence to construct semantically consistent, discrete item indices and auxiliary alignment tasks that ground linguistic representations in interaction semantics; (ii) Reasoning Curriculum Activation, which builds a synthetic dataset with explicit Chain-of-Thought supervision and a curriculum that progresses through behavioral evidence extraction, latent preference modeling, intent inference, recommendation formulation, and denoised sequence rewriting; and (iii) Sparse-Regularized Group Policy Optimization (SRPO), which stabilizes post-training via Residual-Sensitive Verifiable Reward and Bonus-Calibrated Group Advantage Estimation, enabling end-to-end optimization under verifiable signals despite sparse successes. GREAM natively supports two complementary inference modes: Direct Sequence Recommendation for high-throughput, low-latency deployment, and Sequential Reasoning Recommendation that first emits an interpretable reasoning chain for causal transparency. Experiments on three datasets demonstrate consistent gains over strong baselines, providing a practical path toward verifiable-RL-driven LLM recommenders.
2. 【2510.20768】RAGRank: Using PageRank to Counter Poisoning in CTI LLM Pipelines
链接:https://arxiv.org/abs/2510.20768
作者:Austin Jia,Avaneesh Ramesh,Zain Shamsi,Daniel Zhang,Alex Liu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Model, operationalize Large Language, Cyber Threat Intelligence, Language Model, Large Language
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as the dominant architectural pattern to operationalize Large Language Model (LLM) usage in Cyber Threat Intelligence (CTI) systems. However, this design is susceptible to poisoning attacks, and previously proposed defenses can fail for CTI contexts as cyber threat information is often completely new for emerging attacks, and sophisticated threat actors can mimic legitimate formats, terminology, and stylistic conventions. To address this issue, we propose that the robustness of modern RAG defenses can be accelerated by applying source credibility algorithms on corpora, using PageRank as an example. In our experiments, we demonstrate quantitatively that our algorithm applies a lower authority score to malicious documents while promoting trusted content, using the standardized MS MARCO dataset. We also demonstrate proof-of-concept performance of our algorithm on CTI documents and feeds.
3. 【2510.20674】Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE
链接:https://arxiv.org/abs/2510.20674
作者:Rakshith R,Shubham Sharma,Mohammed Sameer Khan,Ankush Chopra
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:e-commerce search system, search system developed, AICOE team, multilingual e-commerce search, multilingual search query
备注:
点击查看摘要
Abstract:This study presents the multilingual e-commerce search system developed by the Tredence_AICOE team. The competition features two multilingual relevance tasks: Query-Category (QC) Relevance, which evaluates how well a user's search query aligns with a product category, and Query-Item (QI) Relevance, which measures the match between a multilingual search query and an individual product listing. To ensure full language coverage, we performed data augmentation by translating existing datasets into languages missing from the development set, enabling training across all target languages. We fine-tuned Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies. The Gemma-3 12B (4-bit) model achieved the best QC performance using original and translated data, and the best QI performance using original, translated, and minority class data creation. These approaches secured 4th place on the final leaderboard, with an average F1-score of 0.8857 on the private test set.
4. 【2510.20609】Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets
链接:https://arxiv.org/abs/2510.20609
作者:Timur Galimzyanov,Olga Kolomyttseva,Egor Bogomolov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Long Code Arena, code-focused generation tasks, realistic compute budgets, study retrieval design, design for code-focused
备注:
点击查看摘要
Abstract:We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
5. 【2510.20455】Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation
链接:https://arxiv.org/abs/2510.20455
作者:Xiaokai Wei,Jiajun Wu,Daiyao Yi,Reza Shirkavand,Michelle Gong
类目:Information Retrieval (cs.IR)
关键词:typically transformer-based autoregressive, transformer-based autoregressive models, user interaction history, typically transformer-based, transformer-based autoregressive
备注:
点击查看摘要
Abstract:Generative recommenders, typically transformer-based autoregressive models, predict the next item or action from a user's interaction history. Their effectiveness depends on how the model represents where an interaction event occurs in the sequence (discrete index) and when it occurred in wall-clock time. Prevailing approaches inject time via learned embeddings or relative attention biases. In this paper, we argue that RoPE-based approaches, if designed properly, can be a stronger alternative for jointly modeling temporal and sequential information in user behavior sequences. While vanilla RoPE in LLMs considers only token order, generative recommendation requires incorporating both event time and token index. To address this, we propose Time-and-Order RoPE (TO-RoPE), a family of rotary position embedding designs that treat index and time as angle sources shaping the query-key geometry directly. We present three instantiations: early fusion, split-by-dim, and split-by-head. Extensive experiments on both publicly available datasets and a proprietary industrial dataset show that TO-RoPE variants consistently improve accuracy over existing methods for encoding time and index. These results position rotary embeddings as a simple, principled, and deployment-friendly foundation for generative recommendation.
6. 【2510.20276】From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era
链接:https://arxiv.org/abs/2510.20276
作者:Wonil Kim,Hyeongseok Wi,Seungsoon Park,Taejun Kim,Sangeun Keum,Keunhyoung Kim,Taewan Kim,Jongmin Jung,Taehyoung Kim,Gaetan Guerrero,Mael Le Goff,Julie Po,Dongjoo Moon,Juhan Nam,Jongpil Lee
类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Sound (cs.SD)
关键词:rapid growth exposes, growth exposes structural, exposes structural gaps, reshaping music creation, economic models
备注: Accepted to the NeurIPS 2025 AI4Music Workshop
点击查看摘要
Abstract:Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.
7. 【2510.20260】Balancing Fine-tuning and RAG: A Hybrid Strategy for Dynamic LLM Recommendation Updates
链接:https://arxiv.org/abs/2510.20260
作者:Changping Meng,Hongyi Ling,Jianling Wang,Yifan Liu,Shuzhou Zhang,Dapeng Hong,Mingyan Gao,Onkar Dalal,Ed Chi,Lichan Hong,Haokai Lu,Ningren Han
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, empower recommendation systems, empower recommendation
备注: RecSys 2025 Industry Track
点击查看摘要
Abstract:Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.
8. 【2510.20193】Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
链接:https://arxiv.org/abs/2510.20193
作者:Rahul Raja,Arpita Vats
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Question Answering, structured text data, structured metadata, structured text, traditionally relied
备注: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. [this https URL](https://doi.org/10.1145/3746274.3760393)
点击查看摘要
Abstract:Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
9. 【2510.20150】Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
链接:https://arxiv.org/abs/2510.20150
作者:Yaochen Zhu,Harald Steck,Dawen Liang,Yinhan He,Jundong Li,Nathan Kallus
类目:Information Retrieval (cs.IR)
关键词:Large language models, Large language, recommender system paradigm, language models, paradigm by enabling
备注:
点击查看摘要
Abstract:Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at this https URL.
10. 【2510.19986】Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts
链接:https://arxiv.org/abs/2510.19986
作者:Drew B. Thomas
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, Large Language, Holy Roman Empire, Retrieval-Augmented Generation
备注: 29 pages, 7 figures. First presented at the "Digital Humanities and Artificial Intelligence" conference at the University of Reading on 17 June 2024
点击查看摘要
Abstract:This paper presents a novel methodology for classifying early modern religious images by using Large Language Models (LLMs) and vector databases in combination with Retrieval-Augmented Generation (RAG). The approach leverages the full-page context of book illustrations from the Holy Roman Empire, allowing the LLM to generate detailed descriptions that incorporate both visual and textual elements. These descriptions are then matched to relevant Iconclass codes through a hybrid vector search. This method achieves 87% and 92% precision at five and four levels of classification, significantly outperforming traditional image and keyword-based searches. By employing full-page descriptions and RAG, the system enhances classification accuracy, offering a powerful tool for large-scale analysis of early modern visual archives. This interdisciplinary approach demonstrates the growing potential of LLMs and RAG in advancing research within art history and digital humanities.
计算机视觉
1. 【2510.20822】HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
链接:https://arxiv.org/abs/2510.20822
作者:Yihao Meng,Hao Ouyang,Yue Yu,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Hanlin Wang,Yixuan Li,Cheng Chen,Yanhong Zeng,Yujun Shen,Huamin Qu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating isolated clips, creating the coherent, essence of storytelling, excel at generating, generating isolated
备注: Project page and code: [this https URL](https://holo-cine.github.io/)
点击查看摘要
Abstract:State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: this https URL.
2. 【2510.20820】LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
链接:https://arxiv.org/abs/2510.20820
作者:Guocheng Gordon Qian,Ruihang Zhang,Tsai-Shien Chen,Yusuf Dalva,Anujraaj Argo Goyal,Willi Menapace,Ivan Skorokhodov,Meng Dong,Arpit Sahni,Daniil Ostashev,Ju Hu,Sergey Tulyakov,Kuan-Chieh Jackson Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative models lack, models lack interactive, existing personalized generative, personalized generative models, impressive visual fidelity
备注: 9 pages, preprint
点击查看摘要
Abstract:Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
3. 【2510.20819】owards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
链接:https://arxiv.org/abs/2510.20819
作者:Nimrod Berman,Omkar Joglekar,Eitan Kosman,Dotan Di Castro,Omri Azencot
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:complex data distributions, positioned diffusion models, Denoising Diffusion Bridge, Recent advances, Diffusion Bridge Model
备注:
点击查看摘要
Abstract:Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: this https URL.
4. 【2510.20814】SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution
链接:https://arxiv.org/abs/2510.20814
作者:Ritik Shah,Marco F Duarte
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:causing blurred boundaries, low spatial resolution, sensors capture dense, Hyperspectral sensors capture, capture dense spectra
备注:
点击查看摘要
Abstract:Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.
5. 【2510.20813】GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation
链接:https://arxiv.org/abs/2510.20813
作者:Guangqi Jiang,Haoran Chang,Ri-Zhao Qiu,Yutong Liang,Mazeyu Ji,Jiyue Zhu,Zhao Dong,Xueyan Zou,Xiaolong Wang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:paper presents GSWorld, Gaussian Splatting, Gaussian Scene Description, Splatting with physics, presents GSWorld
备注:
点击查看摘要
Abstract:This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL.
6. 【2510.20812】Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
链接:https://arxiv.org/abs/2510.20812
作者:Yuhan Liu,Lianhui Qin,Shengjie Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:fine-grained graphical elements, achieved remarkable progress, densely interleave textual, interleave textual annotations, multimodal understanding
备注:
点击查看摘要
Abstract:Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at this https URL
7. 【2510.20809】Real Deep Research for AI, Robotics and Beyond
链接:https://arxiv.org/abs/2510.20809
作者:Xueyan Zou,Jianglong Ye,Hao Zhang,Xiaoyu Xiang,Mingyu Ding,Zhaojing Yang,Yong Jae Lee,Zhuowen Tu,Sifei Liu,Xiaolong Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:rapid growth, increasingly difficult, Real Deep Research, Fast evolving trends, difficult for researchers
备注: website: [this https URL](https://realdeepresearch.github.io)
点击查看摘要
Abstract:With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one's expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
8. 【2510.20807】Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
链接:https://arxiv.org/abs/2510.20807
作者:Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:autoregressive large language, large language models, visual domain, large language, recent success
备注: 14 pages, 14 figures
点击查看摘要
Abstract:Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
9. 【2510.20803】ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
链接:https://arxiv.org/abs/2510.20803
作者:Xiaolong Wang,Lixiang Ru,Ziyuan Huang,Kaixiang Ji,Dandan Zheng,Jingdong Chen,Jun Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:AutoRegressive Generation-based paradigm, AutoRegressive Generation-based, Generation-based paradigm, achieving multimodal understanding, achieving multimodal
备注: Accepted to NeurIPS 2025, 18 pages
点击查看摘要
Abstract:We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
10. 【2510.20800】Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
链接:https://arxiv.org/abs/2510.20800
作者:Shiva Sreeram,Alaa Maalouf,Pratyusha Sharma,Daniela Rus
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:pruning high-order components, chosen LLM weight, LLM weight matrices, boost downstream accuracy, method called
备注:
点击查看摘要
Abstract:Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM's weight matrices can boost downstream accuracy -- without any gradient-based fine-tuning. Yet LASER's exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected -- eliminating the layer-by-layer sweep, (ii) The gradient of each matrix's singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data -- both for computing the indicative gradients and for measuring the final accuracy -- suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets -- entirely without fine-tuning.
11. 【2510.20794】Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature
链接:https://arxiv.org/abs/2510.20794
作者:Lei Cheng,Siyang Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:minimizing manual interventions, enhance tracking efficiency, manual interventions, presents a Multi-Object, efficiency while minimizing
备注: accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
点击查看摘要
Abstract:This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at this https URL
12. 【2510.20776】CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image
链接:https://arxiv.org/abs/2510.20776
作者:Binbin Huang,Haobin Duan,Yiqun Zhao,Zibo Zhao,Yi Ma,Shenghua Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:work proposes, accurately infers, named Cupid, Cupid, input camera poses
备注: project page at [this https URL](https://cupid3d.github.io)
点击查看摘要
Abstract:This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit this http URL.
13. 【2510.20771】AlphaFlow: Understanding and Improving MeanFlow Models
链接:https://arxiv.org/abs/2510.20771
作者:Huijie Zhang,Aliaksandr Siarohin,Willi Menapace,Michael Vasilkovsky,Sergey Tulyakov,Qing Qu,Ivan Skorokhodov
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:few-step generative modeling, generative modeling trained, fully understood, trajectory flow matching, flow
备注:
点击查看摘要
Abstract:MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce $\alpha$-Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, $\alpha$-Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, $\alpha$-Flow consistently outperforms MeanFlow across scales and settings. Our largest $\alpha$-Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).
14. 【2510.20766】DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
链接:https://arxiv.org/abs/2510.20766
作者:Noam Issachar,Guy Yariv,Sagie Benaim,Yossi Adi,Dani Lischinski,Raanan Fattal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains extremely costly, extremely costly due, self-attention mechanism quadratic, mechanism quadratic scaling, Dynamic Position Extrapolation
备注:
点击查看摘要
Abstract:Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at this https URL.
15. 【2510.20762】MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs
链接:https://arxiv.org/abs/2510.20762
作者:Jan Sobotka,Luca Baroni,Ján Antolík
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:neural population activity, brain-machine interfaces, neural population, crucial for understanding, understanding the brain
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
16. 【2510.20754】ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology
链接:https://arxiv.org/abs/2510.20754
作者:Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Diana Mechtcheriakova,Amirreza Mahbod
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated histopathological image, histopathological image analysis, image analysis plays, Automated histopathological, analysis plays
备注: 5 pages
点击查看摘要
Abstract:Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: this https URL
17. 【2510.20726】AutoScape: Geometry-Consistent Long-Horizon Scene Generation
链接:https://arxiv.org/abs/2510.20726
作者:Jiacheng Chen,Ziyu Jiang,Mingfu Liang,Bingbing Zhuang,Jong-Chyi Su,Sparsh Garg,Ying Wu,Manmohan Chandraker
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene generation framework, paper proposes AutoScape, generation framework, paper proposes, driving scene generation
备注: ICCV 2025. Project page: [this https URL](https://auto-scape.github.io)
点击查看摘要
Abstract:This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
18. 【2510.20708】ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata
链接:https://arxiv.org/abs/2510.20708
作者:Samuel Soutullo,Miguel Yermo,David L. Vilariño,Óscar G. Lorenzo,José C. Cabaleiro,Francisco F. Rivera
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:environmental monitoring, autonomous navigation, range images, essential for autonomous, Lossless Range Images
备注:
点击查看摘要
Abstract:3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.
19. 【2510.20707】Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
链接:https://arxiv.org/abs/2510.20707
作者:Xuyang Liu,Xiyan Gui,Yuchao Zhang,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent large vision-language, large vision-language models, limits deployment scalability, critical memory bottleneck, fundamentally limits deployment
备注: Our code is available at [this https URL](https://github.com/xuyang-liu16/MixKV)
点击查看摘要
Abstract:Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{this https URL}{\textcolor{citeblue}{this https URL}}.
20. 【2510.20696】Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
链接:https://arxiv.org/abs/2510.20696
作者:Jing Bi,Guangyu Sun,Ali Vosoughi,Chen Chen,Chenliang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal large language, Multimodal large, textual reasoning leverage, complex visual tasks, large language models
备注: 5 pages
点击查看摘要
Abstract:Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
21. 【2510.20673】Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling
链接:https://arxiv.org/abs/2510.20673
作者:Jinhee Kim,Jae Jun An,Kang Eun Jeon,Jong Hwan Ko
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multi-bit quantization networks, deep neural networks, Multi-bit quantization, supporting multiple precision, multiple precision levels
备注:
点击查看摘要
Abstract:Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at this https URL.
22. 【2510.20669】HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification
链接:https://arxiv.org/abs/2510.20669
作者:Debojyoti Ghosh,Adrijit Goswami
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate waste classification, footprint of urbanization, Accurate waste, vital for achieving, achieving sustainable waste
备注:
点击查看摘要
Abstract:Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.
23. 【2510.20661】UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset
链接:https://arxiv.org/abs/2510.20661
作者:Chen Zhao,En Ci,Yunzhe Xu,Tiehan Fan,Shanyan Guan,Yanhao Ge,Jian Yang,Ying Tai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:notable progress, UHR, large-scale high-quality UHR, Discrete Fourier Transform, UHR scenarios
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{this https URL}{here}.
24. 【2510.20639】Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
链接:https://arxiv.org/abs/2510.20639
作者:Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Dong Yang,Pengfei Guo,Marc Edgar,Daguang Xu,Bernhard Kainz,Bjoern Menze
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale computed tomography, paired free-text reports, Recent progress, powerful pretrained models, stronger architectures
备注: NeurIPS 2025
点击查看摘要
Abstract:Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: this https URL
25. 【2510.20634】Deep Learning in Dental Image Analysis: A Systematic Review of Datasets, Methodologies, and Emerging Challenges
链接:https://arxiv.org/abs/2510.20634
作者:Zhenhuan Zhou,Jingbo Zhu,Yuchen Zhang,Xiaohang Guan,Peng Wang,Tao Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:optimal treatment planning, achieve accurate diagnosis, Efficient analysis, crucial for dentists, dentists to achieve
备注: 52 pages, 24 figures. Under Review
点击查看摘要
Abstract:Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians' expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.
26. 【2510.20622】SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
链接:https://arxiv.org/abs/2510.20622
作者:Yuan Sheng,Yanbin Hao,Chenxu Li,Shuo Wang,Xiangnan He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:temporally scattered content, remains challenging due, Long video understanding, understanding remains challenging, video understanding remains
备注:
点击查看摘要
Abstract:Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
27. 【2510.20605】OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects
链接:https://arxiv.org/abs/2510.20605
作者:Mark He Huang,Lin Geng Foo,Christian Theobalt,Ying Sun,De Wen Soh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:video remains challenging, monocular video remains, arbitrary object motion, remains challenging, reliable pose
备注: NeurIPS 2025 (Spotlight)
点击查看摘要
Abstract:Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.
28. 【2510.20596】Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation
链接:https://arxiv.org/abs/2510.20596
作者:Ziyu Ye,Chen Ju,Chaofan Ma,Xiaoyun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieved great success, Deep learning models, face drastic performance, drastic performance degradation, vision challenges
备注: MICCAI 2021
点击查看摘要
Abstract:Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.
29. 【2510.20586】GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models
链接:https://arxiv.org/abs/2510.20586
作者:Muhammad Atif Butt,Alexandra Gomez-Villa,Tao Wu,Javier Vazquez-Corral,Joost Van De Weijer,Kai Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:producing high-quality images, Recent years, unified models producing, models producing high-quality, image generative
备注:
点击查看摘要
Abstract:Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models' true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.
30. 【2510.20579】Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
链接:https://arxiv.org/abs/2510.20579
作者:Jiahao Meng,Xiangtai Li,Haochen Wang,Yue Tan,Tao Zhang,Lingdong Kong,Yunhai Tong,Anran Wang,Zhiyang Teng,Yujing Wang,Zhuochen Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:generate textual reasoning, textual reasoning traces, generate textual, video reasoning models, reasoning
备注:
点击查看摘要
Abstract:Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
31. 【2510.20578】EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence
链接:https://arxiv.org/abs/2510.20578
作者:Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Artificial General Intelligence, robust spatial perception, realization of Artificial, effective task planning, General Intelligence
备注:
点击查看摘要
Abstract:The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at this https URL.
32. 【2510.20558】From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail
链接:https://arxiv.org/abs/2510.20558
作者:Xiaohan Sun,Carol O'Sullivan
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
关键词:Neural Radiance Fields, levels of detail, viewing distances, crowd character representations, investigate how users
备注:
点击查看摘要
Abstract:In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.
33. 【2510.20550】From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging
链接:https://arxiv.org/abs/2510.20550
作者:Fuchen Li,Yansong Du,Wenbo Cheng,Xiaoxia Zhou,Sen Yin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high dynamic range, Consumer-grade camera systems, maintain stable image, complex illumination conditions, color temperature variation
备注: 13 pages. Code and project page will be released
点击查看摘要
Abstract:Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.
34. 【2510.20549】Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation
链接:https://arxiv.org/abs/2510.20549
作者:Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:challenging lighting remains, lighting remains, remains an open, SLAM technologies, open challenge
备注: 8 pages, 7 figures, 4 tables
点击查看摘要
Abstract:Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.
35. 【2510.20539】Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image
链接:https://arxiv.org/abs/2510.20539
作者:Guillermo Carbajal,Andrés Almansa,Pablo Musé
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Motion blur caused, Projective Motion Blur, rotational movements, remains a major, camera motion
备注:
点击查看摘要
Abstract:Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2510.20539 [cs.CV]
(or
arXiv:2510.20539v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.20539
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2510.20531】Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis
链接:https://arxiv.org/abs/2510.20531
作者:Lixiong Qin,Yang Zhang,Mei Wang,Jiani Hu,Weihong Deng,Weiran Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, Large Language, Explainable DeepFake Analysis, Language Models
备注: 25 pages, 9 figures, 17 tables
点击查看摘要
Abstract:The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at this https URL.
37. 【2510.20519】Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning
链接:https://arxiv.org/abs/2510.20519
作者:Xiaohan Lan,Fanfan Liu,Haibo Qiu,Siqi Yang,Delian Ruan,Peng Shi,Lin Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieving significant performance, significant performance gains, Inspired by recent, advancements in LLM, achieving significant
备注:
点击查看摘要
Abstract:Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ''Hybrid Thinking'' paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model's general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.
38. 【2510.20512】EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization
链接:https://arxiv.org/abs/2510.20512
作者:Yixiong Yang,Tao Wu,Senmao Li,Shiqi Yang,Yaxing Wang,Joost van de Weijer,Kai Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, teacher model, one-step diffusion model, model, advances in accelerating
备注: Project page available at [this https URL](https://liulisixin.github.io/EchoDistill-page/)
点击查看摘要
Abstract:Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher's output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.
39. 【2510.20482】Reliable and Reproducible Demographic Inference for Fairness in Face Analysis
链接:https://arxiv.org/abs/2510.20482
作者:Alexandre Fournier-Montgieux,Hervé Le Borgne,Adrian Popescu,Bertrand Luvison
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:face analysis systems, analysis systems, typically depends, depends on automatic, relies on predefined
备注:
点击查看摘要
Abstract:Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.
40. 【2510.20470】Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
链接:https://arxiv.org/abs/2510.20470
作者:Kun Ouyang,Yuanxin Liu,Linli Yao,Yishuo Cai,Hao Zhou,Jie Zhou,Fandong Meng,Xu Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, requires multi-step deduction, remains a major, language models
备注:
点击查看摘要
Abstract:Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
41. 【2510.20468】ransferable Black-Box One-Shot Forging of Watermarks via Image Preference Models
链接:https://arxiv.org/abs/2510.20468
作者:Tomáš Souček,Sylvestre-Alvise Rebuffi,Pierre Fernandez,Nikola Jovanović,Hady Elsahar,Valeriu Lacatusu,Tuan Tran,Alexandre Mourachko
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:increased legal pressure, Recent years, legal pressure, surge in interest, interest in digital
备注: NeurIPS 2025
点击查看摘要
Abstract:Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model's capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.
42. 【2510.20438】Dynamic Weight Adjustment for Knowledge Distillation: Leveraging Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment
链接:https://arxiv.org/abs/2510.20438
作者:Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:leveraging dynamic fuzzy, dynamic fuzzy logic-driven, fuzzy logic-driven knowledge, logic-driven knowledge distillation, lung cancer
备注:
点击查看摘要
Abstract:This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.
43. 【2510.20393】Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval
链接:https://arxiv.org/abs/2510.20393
作者:Qing Wang,Chong-Wah Ngo,Yu Cao,Ee-Peng Lim
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Existing approaches, details textually documented, food image, implicit assumption, textually documented
备注: ACM Multimedia 2025
点击查看摘要
Abstract:Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.
44. 【2510.20385】Positional Encoding Field
链接:https://arxiv.org/abs/2510.20385
作者:Yunpeng Bai,Haoxiang Li,Qixing Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, dominant architecture, combine Transformer scalability, DiTs combine Transformer, Diffusion
备注: 8 pages, 9 figures
点击查看摘要
Abstract:Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
45. 【2510.20349】Synthetic Data for Robust Runway Detection
链接:https://arxiv.org/abs/2510.20349
作者:Estelle Chigot,Dennis G. Wilson,Meriem Ghrib,Fabrice Jimenez,Thomas Oberlin
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep vision models, Deep vision, possibly critical applications, integrated in industrial, industrial and possibly
备注:
点击查看摘要
Abstract:Deep vision models are now mature enough to be integrated in industrial and possibly critical applications such as autonomous navigation. Yet, data collection and labeling to train such models requires too much efforts and costs for a single company or product. This drawback is more significant in critical applications, where training data must include all possible conditions including rare scenarios. In this perspective, generating synthetic images is an appealing solution, since it allows a cheap yet reliable covering of all the conditions and environments, if the impact of the synthetic-to-real distribution shift is mitigated. In this article, we consider the case of runway detection that is a critical part in autonomous landing systems developed by aircraft manufacturers. We propose an image generation approach based on a commercial flight simulator that complements a few annotated real images. By controlling the image generation and the integration of real and synthetic data, we show that standard object detection models can achieve accurate prediction. We also evaluate their robustness with respect to adverse conditions, in our case nighttime images, that were not represented in the real data, and show the interest of using a customized domain adaptation strategy.
46. 【2510.20348】AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models
链接:https://arxiv.org/abs/2510.20348
作者:Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:denoising steps, diffusion models, PTQ, diffusion, denoising
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.
47. 【2510.20335】Dino-Diffusion Modular Designs Bridge the Cross-Domain Gap in Autonomous Parking
链接:https://arxiv.org/abs/2510.20335
作者:Zixuan Wu,Hengyuan Zhang,Ting-Hsuan Chen,Yuliang Guo,David Paz,Xinyu Huang,Liu Ren
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:driving safety, critical pillar, pillar of driving, Parking, Abstract
备注: Code is at [this https URL](https://github.com/ChampagneAndfragrance/Dino_Diffusion_Parking_Official)
点击查看摘要
Abstract:Parking is a critical pillar of driving safety. While recent end-to-end (E2E) approaches have achieved promising in-domain results, robustness under domain shifts (e.g., weather and lighting changes) remains a key challenge. Rather than relying on additional data, in this paper, we propose Dino-Diffusion Parking (DDP), a domain-agnostic autonomous parking pipeline that integrates visual foundation models with diffusion-based planning to enable generalized perception and robust motion planning under distribution shifts. We train our pipeline in CARLA at regular setting and transfer it to more adversarial settings in a zero-shot fashion. Our model consistently achieves a parking success rate above 90% across all tested out-of-distribution (OOD) scenarios, with ablation studies confirming that both the network architecture and algorithmic design significantly enhance cross-domain performance over existing baselines. Furthermore, testing in a 3D Gaussian splatting (3DGS) environment reconstructed from a real-world parking lot demonstrates promising sim-to-real transfer.
48. 【2510.20331】AnyPcc: Compressing Any Point Cloud with a Single Universal Model
链接:https://arxiv.org/abs/2510.20331
作者:Kangli Wang,Qianxi Yi,Yuqi Ye,Shihao Li,Wei Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generalization remains, deep learning-based point, learning-based point cloud, remains a critical, critical challenge
备注: 11 pages, 5 figures
点击查看摘要
Abstract:Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.
49. 【2510.20322】HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
链接:https://arxiv.org/abs/2510.20322
作者:Zelin Peng,Zhengqin Xu,Qingyang Liu,Xiaokang Yang,Wei Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal large language, Multi-modal large, transformative approach, approach for aligning, large language models
备注: Accepted by NeurIPS2025
点击查看摘要
Abstract:Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with Möbius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\% additional parameters.
50. 【2510.20291】A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization
链接:https://arxiv.org/abs/2510.20291
作者:LinFeng Li,Jian Zhao,Zepeng Yang,Yuhang Song,Bojun Lin,Tianle Zhang,Yuchen Yuan,Chi Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Cross-Modal Drone Navigation, Drone Navigation, solution to RoboSense, present a winning, winning solution
备注:
点击查看摘要
Abstract:We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.
51. 【2510.20287】Breakdance Video classification in the age of Generative AI
链接:https://arxiv.org/abs/2510.20287
作者:Sauptik Dhar,Naveen Ramakrishnan,Michelle Munson
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Vision Language, Large Vision, Vision Language models, sports use-cases recently, Vision Language
备注: 11 pages
点击查看摘要
Abstract:Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
52. 【2510.20286】UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning
链接:https://arxiv.org/abs/2510.20286
作者:Liangyu Chen,Hanzhang Zhou,Chenglin Cai,Jianan Zhang,Panrong Tong,Quyu Kong,Xu Zhang,Chen Liu,Yuqi Liu,Wenxuan Wang,Yue Wang,Qin Jin,Steven Hoi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:GUI agents, maps natural-language instructions, GUI grounding, capability of GUI, GUI
备注:
点击查看摘要
Abstract:GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in this https URL.
53. 【2510.20285】DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
链接:https://arxiv.org/abs/2510.20285
作者:Jiayi Zou,Chaofan Chen,Bing-Kun Bao,Changsheng Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Video Question Answering, Egocentric Video Question, answering questions based, egocentric video understanding, Video Question
备注:
点击查看摘要
Abstract:Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51\% and 46.04\% on the \textit{normal} and \textit{indirect} splits of EgoTaskQA, and 13.2\% on QAEGO4D, both reaching the state-of-the-art performance.
54. 【2510.20284】Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition
链接:https://arxiv.org/abs/2510.20284
作者:Haodong Yang,Zhongling Huang,Shaojie Guo,Zhe Zhang,Gong Cheng,Junwei Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, complex-valued Synthetic Aperture, Deep learning models, Aperture Radar, Synthetic Aperture
备注:
点击查看摘要
Abstract:Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.
55. 【2510.20281】Causal Debiasing for Visual Commonsense Reasoning
链接:https://arxiv.org/abs/2510.20281
作者:Jiayi Zou,Gengyun Jia,Bing-Kun Bao
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Visual Commonsense Reasoning, Commonsense Reasoning, providing explanations based, Visual Commonsense, refers to answering
备注:
点击查看摘要
Abstract:Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
56. 【2510.20268】GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection
链接:https://arxiv.org/abs/2510.20268
作者:Guangyu Dai,Dong Chen,Siliang Tang,Yueting Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Video anomaly detection, detects anomalous frames, continuous surveillance videos, anomaly detection, Video anomaly
备注:
点击查看摘要
Abstract:Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.
57. 【2510.20267】Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals
链接:https://arxiv.org/abs/2510.20267
作者:Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visually impaired, visually impaired individuals, Technologies, impaired individuals, impaired
备注: 20 pages, 5 tables, 8 figues
点击查看摘要
Abstract:Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21\%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.
58. 【2510.20261】Kinaema: a recurrent sequence model for memory and pose in motion
链接:https://arxiv.org/abs/2510.20261
作者:Mert Bulent Sariyildiz,Philippe Weinzaepfel,Guillaume Bono,Gianluca Monaci,Christian Wolf
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:spatially aware robots, find their bearings, key aspect, aspect of spatially, spatially aware
备注: 10 pages + references + checklist + appendix, 29 pages total
点击查看摘要
Abstract:One key aspect of spatially aware robots is the ability to "find their bearings", ie. to correctly situate themselves in previously seen spaces. In this work, we focus on this particular scenario of continuous robotics operations, where information observed before an actual episode start is exploited to optimize efficiency. We introduce a new model, Kinaema, and agent, capable of integrating a stream of visual observations while moving in a potentially large scene, and upon request, processing a query image and predicting the relative position of the shown space with respect to its current position. Our model does not explicitly store an observation history, therefore does not have hard constraints on context length. It maintains an implicit latent memory, which is updated by a transformer in a recurrent way, compressing the history of sensor readings into a compact representation. We evaluate the impact of this model in a new downstream task we call "Mem-Nav". We show that our large-capacity recurrent model maintains a useful representation of the scene, navigates to goals observed before the actual episode start, and is computationally efficient, in particular compared to classical transformers with attention over an observation history.
59. 【2510.20256】Calibrating Multimodal Consensus for Emotion Recognition
链接:https://arxiv.org/abs/2510.20256
作者:Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:made substantial progress, Multimodal Emotion Recognition, Emotion Recognition, Multimodal Emotion, recent years
备注:
点击查看摘要
Abstract:In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at this https URL.
60. 【2510.20247】Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization
链接:https://arxiv.org/abs/2510.20247
作者:Shuhan Hu,Yiru Li,Yuanyuan Li,Yingying Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables high-precision object, geo-localization enables high-precision, urban management, autonomous driving, disaster response
备注:
点击查看摘要
Abstract:Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from "location-aware" to "object-aware." Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.
61. 【2510.20244】Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
链接:https://arxiv.org/abs/2510.20244
作者:Minseok Kang,Minhyeok Lee,Minjung Kim,Donghyeong Kim,Sangyoun Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:natural language query, localize temporal segments, aims to localize, segments in long, language query
备注: Comments: 28 pages, including appendix. 5 figures. Full version of the NeurIPS 2025 paper
点击查看摘要
Abstract:Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.
62. 【2510.20238】COS3D: Collaborative Open-Vocabulary 3D Segmentation
链接:https://arxiv.org/abs/2510.20238
作者:Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging task, requiring a mutual, fundamental yet challenging, mutual understanding, language field
备注: NeurIPS 2025. The code is publicly available at \href{ [this https URL](https://github.com/Runsong123/COS3D) }{ [this https URL](https://github.com/Runsong123/COS3D) }
点击查看摘要
Abstract:Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{this https URL}{this https URL}.
63. 【2510.20229】Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
链接:https://arxiv.org/abs/2510.20229
作者:Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Large Vision-Language, Vision-Language Models, made significant progress, progress in recent
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.
64. 【2510.20217】EditInfinity: Image Editing with Binary-Quantized Generative Models
链接:https://arxiv.org/abs/2510.20217
作者:Jiahuan Wang,Yuxin Chen,Jun Yu,Guangming Lu,Wenjie Pei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adapting pretrained diffusion-based, demonstrated remarkable potential, negligible tuning overhead, Adapting pretrained, image editing
备注: 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
点击查看摘要
Abstract:Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across "add", "change", and "delete" editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: this https URL.
65. 【2510.20214】owards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection
链接:https://arxiv.org/abs/2510.20214
作者:Talha Ilyas,Duong Nhu,Allison Thomas,Arie Levin,Lim Wei Yap,Shu Gong,David Vera Anaya,Yiwen Jiang,Deval Mehta,Ritesh Warty,Vinayak Smith,Maya Reddy,Euan Wallace,Wenlong Cheng,Zongyuan Ge,Faezeh Marzbanrad
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate fetal movement, assessing prenatal health, abnormal movement patterns, Accurate fetal, essential for assessing
备注: This is the preprint version of the manuscript submitted to IEEE Journal of Biomedical and Health Informatics (JBHI) for review
点击查看摘要
Abstract:Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.
66. 【2510.20212】FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing
链接:https://arxiv.org/abs/2510.20212
作者:Yanghao Wang,Zhen Wang,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled remarkable progress, Recent advances, advances in pre-trained, flow models, text-based image editing
备注:
点击查看摘要
Abstract:Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
67. 【2510.20206】RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
链接:https://arxiv.org/abs/2510.20206
作者:Bingjie Gao,Qianli Ma,Xiaoxue Wu,Shuai Yang,Guanzhou Lan,Haonan Zhao,Jiaxuan Chen,Qingyang Liu,Yu Qiao,Xinyuan Chen,Yaohui Wang,Li Niu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Prompt design plays, prompt optimization, potential of diffusion-based, design plays, plays a crucial
备注:
点击查看摘要
Abstract:Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at this https URL.
68. 【2510.20196】A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development
链接:https://arxiv.org/abs/2510.20196
作者:Minh Sao Khue Luu,Margaret V. Benedichuk,Ekaterina I. Roppert,Roman M. Kenzhin,Bair N. Tuchinov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:MRI depends critically, factors remain scarce, brain MRI depends, brain MRI, brain MRI foundation
备注:
点击查看摘要
Abstract:The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.
69. 【2510.20193】Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
链接:https://arxiv.org/abs/2510.20193
作者:Rahul Raja,Arpita Vats
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Question Answering, structured text data, structured metadata, structured text, traditionally relied
备注: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. [this https URL](https://doi.org/10.1145/3746274.3760393)
点击查看摘要
Abstract:Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
70. 【2510.20189】SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization
链接:https://arxiv.org/abs/2510.20189
作者:Xinyi Hu,Yuran Wang,Yue Li,Wenxuan Liu,Zheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Temporal Intention Localization, Intention Localization, improve security monitoring, Progression Analysis Network, identifying varying levels
备注:
点击查看摘要
Abstract:Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.
71. 【2510.20182】Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories
链接:https://arxiv.org/abs/2510.20182
作者:Aaron Appelle,Jerome P. Lynch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale video generation, demonstrated high visual, high visual realism, general-purpose world simulators, spurring interest
备注: Preprint, under review
点击查看摘要
Abstract:Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
72. 【2510.20178】PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching
链接:https://arxiv.org/abs/2510.20178
作者:Yun Wang,Junjie Hu,Qiaole Dong,Yongjian Zhang,Yanwei Fu,Tin Lun Lam,Dapeng Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:inconsistent depth estimation, depth estimation disrupts, depth estimation, inconsistent depth, estimation disrupts
备注:
点击查看摘要
Abstract:Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbf{P}ick-and-\textbf{P}lay \textbf{M}emory (PPM) construction module for dynamic \textbf{Stereo} matching, dubbed as \textbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3\% \ 9.02\% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolor{blue}{this https URL}.
73. 【2510.20165】IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks
链接:https://arxiv.org/abs/2510.20165
作者:Insu Jeon,Wonkwang Lee,Myeongjang Pyeon,Gunhee Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:GAN-based unsupervised model, disentangled representation learning, representation learning, GAN-based unsupervised, unsupervised model
备注: Published in the Proceedings of the Thirty Fifth AAAI Conference on Artificial Intelligence (AAAI 2021), paper number 7926
点击查看摘要
Abstract:We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \b{eta}-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \b{eta}-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.
74. 【2510.20162】OMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
链接:https://arxiv.org/abs/2510.20162
作者:Xudong Yan,Songhe Feng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Compositional Zero-Shot Learning, attribute-object compositions based, Compositional Zero-Shot, aims to recognize, recognize novel attribute-object
备注: Accepted to NeurIPS 2025
点击查看摘要
Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at this https URL .
75. 【2510.20158】Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists
链接:https://arxiv.org/abs/2510.20158
作者:Eduardo R. Corral-Soto,Yang Liu,Yuan Ren,Bai Dongfeng,Liu Bingbing
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vulnerable Road Users, Autonomous Driving, Road Users, Vulnerable Road, crossing intention classification
备注:
点击查看摘要
Abstract:In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.
76. 【2510.20155】PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding
链接:https://arxiv.org/abs/2510.20155
作者:Penghao Wang,Yiyang He,Xin Lv,Yukai Zhou,Lan Xu,Jingyi Yu,Jiayuan Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advancing computer vision, computer vision, fundamental to advancing, advancing computer, Understanding objects
备注: NeurIPS 2025 DB Track. Project page: [this https URL](https://authoritywang.github.io/partnext)
点击查看摘要
Abstract:Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset's superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.
77. 【2510.20134】Revisiting Logit Distributions for Reliable Out-of-Distribution Detection
链接:https://arxiv.org/abs/2510.20134
作者:Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep learning models, open-world applications, critical for ensuring, ensuring the reliability, reliability of deep
备注: Accepted by NeurIPS 2025
点击查看摘要
Abstract:Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model's logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at this https URL.
78. 【2510.20132】Inverse Image-Based Rendering for Light Field Generation from Single Images
链接:https://arxiv.org/abs/2510.20132
作者:Hyunjun Jung,Hae-Gon Jeon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:supported realistic renderings, concept of light-fields, light-fields computed, computed from multiple, regular grids
备注:
点击查看摘要
Abstract:A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.
79. 【2510.20126】Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects
链接:https://arxiv.org/abs/2510.20126
作者:Prithvi Raj Singh,Raju Gottumukkala,Anthony S. Maida,Alan B. Barhorst,Vijaya Gopu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:objects remains underexplored, tiny objects remains, remains underexplored, general object detection, fast-moving tiny objects
备注: 13 pages, 6 figures
点击查看摘要
Abstract:While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70\% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.
80. 【2510.20108】Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning
链接:https://arxiv.org/abs/2510.20108
作者:Gabriel Y. Arteaga,Marius Aasan,Rwiddhi Chakraborty,Martine Hjelkrem-Tan,Thalles Silva,Michael Kampffmeyer,Adín Ramírez Rivera
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Prototypical self-supervised learning, Prototypical self-supervised, self-supervised learning methods, multiple prototypes converge, methods consistently suffer
备注:
点击查看摘要
Abstract:Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose -- providing diverse and informative targets to guide encoders toward rich representations -- and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes and stronger downstream performance.
81. 【2510.20095】BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
链接:https://arxiv.org/abs/2510.20095
作者:Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:work investigates descriptive, work investigates, additional source, captions, investigates descriptive captions
备注: Project page: [this https URL](https://imageomics.github.io/biocap/)
点击查看摘要
Abstract:This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
82. 【2510.20093】StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
链接:https://arxiv.org/abs/2510.20093
作者:Jiho Park,Sieun Choi,Jaeyoon Seo,Jihie Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:synthesizing pixel-based human-drawn, pixel-based human-drawn sketches, abstract expression, generated images, diffusion models
备注: Under review at IEEE Access. Author-submitted preprint. Not the IEEE-published version
点击查看摘要
Abstract:Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.
83. 【2510.20092】Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency
链接:https://arxiv.org/abs/2510.20092
作者:Hao Yu,Haoyu Chen,Yan Jiang,Wei Peng,Zhaodong Sun,Samuel Kaski,Guoying Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern vision backbones, cornerstone of modern, Conv, traditional Convolutions, Self-attention
备注:
点击查看摘要
Abstract:Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: this http URL.
84. 【2510.20087】Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos
链接:https://arxiv.org/abs/2510.20087
作者:Lorenzo Arboit,Dennis N. Schneider,Britty Baby,Vinkle Srivastav,Pietro Mascagni,Nicolas Padoy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video-based assessment, advance surgical training, surgical data science, quality improvement, data science
备注: 13 pages, 6 figures. Source-available software: [this https URL](https://camma-public.github.io/Endoshare/)
点击查看摘要
Abstract:Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p = 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at this https URL
85. 【2510.20077】Data-Adaptive Transformed Bilateral Tensor Low-Rank Representation for Clustering
链接:https://arxiv.org/abs/2510.20077
作者:Hui Chen,Xinjie Wang,Xianchao Xiu,Wanquan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated significant success, Tensor low-rank representation, demonstrated significant, significant success, TLRR
备注:
点击查看摘要
Abstract:Tensor low-rank representation (TLRR) has demonstrated significant success in image clustering. However, most existing methods rely on fixed transformations and suffer from poor robustness to noise. In this paper, we propose a novel transformed bilateral tensor low-rank representation model called TBTLRR, which introduces a data-adaptive tensor nuclear norm by learning arbitrary unitary transforms, allowing for more effective capture of global correlations. In addition, by leveraging the bilateral structure of latent tensor data, TBTLRR is able to exploit local correlations between image samples and features. Furthermore, TBTLRR integrates the $\ell_{1/2}$-norm and Frobenius norm regularization terms for better dealing with complex noise in real-world scenarios. To solve the proposed nonconvex model, we develop an efficient optimization algorithm inspired by the alternating direction method of multipliers (ADMM) and provide theoretical convergence. Extensive experiments validate its superiority over the state-of-the-art methods in clustering. The code will be available at this https URL.
86. 【2510.20071】Filter-Based Reconstruction of Images from Events
链接:https://arxiv.org/abs/2510.20071
作者:Bernd Pfrommer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:graphics processing units, neural networks deployed, processing units, FIlter Based Asynchronous, typically approached
备注:
点击查看摘要
Abstract:Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR's reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at this https URL
87. 【2510.20042】Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models
链接:https://arxiv.org/abs/2510.20042
作者:Huichan Seo,Sieun Choi,Minki Hong,Yi Zhou,Junseo Kim,Lukman Ismaila,Naome Etori,Mehul Agarwal,Zhixuan Liu,Jihie Kim,Jean Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce striking visuals, models produce striking, misrepresent culture, produce striking, striking visuals
备注: 28 pages, 8 figures. Submitted to the Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI '26)
点击查看摘要
Abstract:Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.
88. 【2510.20029】BrainPuzzle: Hybrid Physics and Data-Driven Reconstruction for Transcranial Ultrasound Tomography
链接:https://arxiv.org/abs/2510.20029
作者:Shengyu Chen,Shihang Feng,Yi Luo,Xiaowei Jia,Youzuo Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, coupling large probes, imaging remains challenging, Ultrasound brain imaging, brain imaging remains
备注: 13 pages
点击查看摘要
Abstract:Ultrasound brain imaging remains challenging due to the large difference in sound speed between the skull and brain tissues and the difficulty of coupling large probes to the skull. This work aims to achieve quantitative transcranial ultrasound by reconstructing an accurate speed-of-sound (SoS) map of the brain. Traditional physics-based full-waveform inversion (FWI) is limited by weak signals caused by skull-induced attenuation, mode conversion, and phase aberration, as well as incomplete spatial coverage since full-aperture arrays are clinically impractical. In contrast, purely data-driven methods that learn directly from raw ultrasound data often fail to model the complex nonlinear and nonlocal wave propagation through bone, leading to anatomically plausible but quantitatively biased SoS maps under low signal-to-noise and sparse-aperture conditions. To address these issues, we propose BrainPuzzle, a hybrid two-stage framework that combines physical modeling with machine learning. In the first stage, reverse time migration (time-reversal acoustics) is applied to multi-angle acquisitions to produce migration fragments that preserve structural details even under low SNR. In the second stage, a transformer-based super-resolution encoder-decoder with a graph-based attention unit (GAU) fuses these fragments into a coherent and quantitatively accurate SoS image. A partial-array acquisition strategy using a movable low-count transducer set improves feasibility and coupling, while the hybrid algorithm compensates for the missing aperture. Experiments on two synthetic datasets show that BrainPuzzle achieves superior SoS reconstruction accuracy and image completeness, demonstrating its potential for advancing quantitative ultrasound brain imaging.
89. 【2510.20027】Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses
链接:https://arxiv.org/abs/2510.20027
作者:Damian Bowness,Charalambos Poullis
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Gaussian Splatting, noise commonly occurs, camera positions significantly, training data distribution, substantial visual noise
备注:
点击查看摘要
Abstract:When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:
arXiv:2510.20027 [cs.CV]
(or
arXiv:2510.20027v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2510.20027
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Damian Bowness [view email] [v1]
Wed, 22 Oct 2025 21:09:16 UTC (37,590 KB)
90. 【2510.20016】A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance
链接:https://arxiv.org/abs/2510.20016
作者:Neema Jakisa Owor,Joshua Kofi Asamoah,Tanner Wambui Muturi,Anneliese Jakisa Owor,Blessing Agyei Kyem,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single vantage point, capturing large fields, Fisheye cameras offer, wide-area traffic surveillance, vantage point
备注: The paper was accepted at ICCV 2025 and published in CVF database
点击查看摘要
Abstract:Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.
91. 【2510.20011】Improving Predictive Confidence in Medical Imaging via Online Label Smoothing
链接:https://arxiv.org/abs/2510.20011
作者:Kushan Choudhury,Shubhrodeep Roy,Ankur Chanda,Shubhajit Biswas,Somenath Kuiry
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:convolutional neural networks, Deep learning models, achieved impressive results, neural networks, Deep learning
备注: Accepted and presented in International Conference on Advancing Science and Technologies in Health Science
点击查看摘要
Abstract:Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model's own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.
92. 【2510.19986】Automating Iconclass: LLMs and RAG for Large-Scale Classification of Religious Woodcuts
链接:https://arxiv.org/abs/2510.19986
作者:Drew B. Thomas
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, Large Language, Holy Roman Empire, Retrieval-Augmented Generation
备注: 29 pages, 7 figures. First presented at the "Digital Humanities and Artificial Intelligence" conference at the University of Reading on 17 June 2024
点击查看摘要
Abstract:This paper presents a novel methodology for classifying early modern religious images by using Large Language Models (LLMs) and vector databases in combination with Retrieval-Augmented Generation (RAG). The approach leverages the full-page context of book illustrations from the Holy Roman Empire, allowing the LLM to generate detailed descriptions that incorporate both visual and textual elements. These descriptions are then matched to relevant Iconclass codes through a hybrid vector search. This method achieves 87% and 92% precision at five and four levels of classification, significantly outperforming traditional image and keyword-based searches. By employing full-page descriptions and RAG, the system enhances classification accuracy, offering a powerful tool for large-scale analysis of early modern visual archives. This interdisciplinary approach demonstrates the growing potential of LLMs and RAG in advancing research within art history and digital humanities.
93. 【2510.19981】FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking
链接:https://arxiv.org/abs/2510.19981
作者:Martha Teiko Teye,Ori Maoz,Matthias Rottmann
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modular camera-LiDAR multi-object, camera-LiDAR multi-object tracking, multi-object tracking framework, builds on existing, detectors by introducing
备注:
点击查看摘要
Abstract:We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.
94. 【2510.19955】ransformed Multi-view 3D Shape Features with Contrastive Learning
链接:https://arxiv.org/abs/2510.19955
作者:Márcus Vinícius Lobo Costa,Sherlon Almeida da Silva,Bárbara Caroline Benato,Leo Sampaio Ferraz Ribeiro,Moacir Antonelli Ponti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, Neural Networks, paper addresses, addresses the challenges, Convolutional Neural
备注:
点击查看摘要
Abstract:This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs' ability to understand overall shapes and contrastive learning's effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.
95. 【2510.19917】FINDER: Feature Inference on Noisy Datasets using Eigenspace Residuals
链接:https://arxiv.org/abs/2510.19917
作者:Trajan Murphy,Akshunna S. Dogra,Hanfeng Gu,Caleb Meredith,Mark Kon,Julio Enrique Castrillion-Candas
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:small sample sizes, key research frontier, faulty data collection, regimes with low, noise ratios
备注: 30 pages, 11 figures, 8 tables. Code available at [this https URL](https://github.com/MathePhysics/FINDER)
点击查看摘要
Abstract:''Noisy'' datasets (regimes with low signal to noise ratios, small sample sizes, faulty data collection, etc) remain a key research frontier for classification methods with both theoretical and practical implications. We introduce FINDER, a rigorous framework for analyzing generic classification problems, with tailored algorithms for noisy datasets. FINDER incorporates fundamental stochastic analysis ideas into the feature learning and inference stages to optimally account for the randomness inherent to all empirical datasets. We construct ''stochastic features'' by first viewing empirical datasets as realizations from an underlying random field (without assumptions on its exact distribution) and then mapping them to appropriate Hilbert spaces. The Kosambi-Karhunen-Loéve expansion (KLE) breaks these stochastic features into computable irreducible components, which allow classification over noisy datasets via an eigen-decomposition: data from different classes resides in distinct regions, identified by analyzing the spectrum of the associated operators. We validate FINDER on several challenging, data-deficient scientific domains, producing state of the art breakthroughs in: (i) Alzheimer's Disease stage classification, (ii) Remote sensing detection of deforestation. We end with a discussion on when FINDER is expected to outperform existing methods, its failure modes, and other limitations.
96. 【2510.19840】Fourier-Based GAN Fingerprint Detection using ResNet50
链接:https://arxiv.org/abs/2510.19840
作者:Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative Adversarial Networks, Generative Adversarial, reliable content authenticity, requiring reliable content, produced from Generative
备注: 6 pages. Published in IEEE
点击查看摘要
Abstract:The rapid rise of photorealistic images produced from Generative Adversarial Networks (GANs) poses a serious challenge for image forensics and industrial systems requiring reliable content authenticity. This paper uses frequency-domain analysis combined with deep learning to solve the problem of distinguishing StyleGAN-generated images from real ones. Specifically, a two-dimensional Discrete Fourier Transform (2D DFT) was applied to transform images into the Fourier domain, where subtle periodic artifacts become detectable. A ResNet50 neural network is trained on these transformed images to differentiate between real and synthetic ones. The experiments demonstrate that the frequency-domain model achieves a 92.8 percent and an AUC of 0.95, significantly outperforming the equivalent model trained on raw spatial-domain images. These results indicate that the GAN-generated images have unique frequency-domain signatures or "fingerprints". The method proposed highlights the industrial potential of combining signal processing techniques and deep learning to enhance digital forensics and strengthen the trustworthiness of industrial AI systems.
97. 【2402.01555】SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning
链接:https://arxiv.org/abs/2402.01555
作者:Samuel Adebayo,Joost C. Dessing,Seán McLoone
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
关键词:enhancing gaze estimation, addressing appearance instability, appearance instability challenges, test domain generalization, covariant shifts
备注:
点击查看摘要
Abstract:In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves a 10.9% improvement on Gaze360, supersedes top MPIIFaceGaze results with 3.8%, and leads on a subset of ETH-XGaze by 11.6%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components.
98. 【2510.20266】GUSL-Dehaze: A Green U-Shaped Learning Approach to Image Dehazing
链接:https://arxiv.org/abs/2510.20266
作者:Mahtab Movaheddrad,Laurence Palmer,C.-C. Jay Kuo
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:single hazy input, hazy input, restoration task, task that aims, aims to recover
备注:
点击查看摘要
Abstract:Image dehazing is a restoration task that aims to recover a clear image from a single hazy input. Traditional approaches rely on statistical priors and the physics-based atmospheric scattering model to reconstruct the haze-free image. While recent state-of-the-art methods are predominantly based on deep learning architectures, these models often involve high computational costs and large parameter sizes, making them unsuitable for resource-constrained devices. In this work, we propose GUSL-Dehaze, a Green U-Shaped Learning approach to image dehazing. Our method integrates a physics-based model with a green learning (GL) framework, offering a lightweight, transparent alternative to conventional deep learning techniques. Unlike neural network-based solutions, GUSL-Dehaze completely avoids deep learning. Instead, we begin with an initial dehazing step using a modified Dark Channel Prior (DCP), which is followed by a green learning pipeline implemented through a U-shaped architecture. This architecture employs unsupervised representation learning for effective feature extraction, together with feature-engineering techniques such as the Relevant Feature Test (RFT) and the Least-Squares Normal Transform (LNT) to maintain a compact model size. Finally, the dehazed image is obtained via a transparent supervised learning strategy. GUSL-Dehaze significantly reduces parameter count while ensuring mathematical interpretability and achieving performance on par with state-of-the-art deep learning models.
99. 【2510.20012】AI Pose Analysis and Kinematic Profiling of Range-of-Motion Variations in Resistance Training
链接:https://arxiv.org/abs/2510.20012
作者:Adam Diamant
类目:Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
关键词:pose estimation pipeline, enable precise quantification, AI-based pose estimation, study develops, pose estimation
备注:
点击查看摘要
Abstract:This study develops an AI-based pose estimation pipeline to enable precise quantification of movement kinematics in resistance training. Using video data from Wolf et al. (2025), which compared lengthened partial (pROM) and full range-of-motion (fROM) training across eight upper-body exercises in 26 participants, 280 recordings were processed to extract frame-level joint-angle trajectories. After filtering and smoothing, per-set metrics were derived, including range of motion (ROM), tempo, and concentric/eccentric phase durations. A random-effects meta-analytic model was applied to account for within-participant and between-exercise variability. Results show that pROM repetitions were performed with a smaller ROM and shorter overall durations, particularly during the eccentric phase of movement. Variance analyses revealed that participant-level differences, rather than exercise-specific factors, were the primary driver of variation, although there is substantial evidence of heterogeneous treatment effects. We then introduce a novel metric, \%ROM, which is the proportion of full ROM achieved during pROM, and demonstrate that this definition of lengthened partials remains relatively consistent across exercises. Overall, these findings suggest that lengthened partials differ from full ROM training not only in ROM, but also in execution dynamics and consistency, highlighting the potential of AI-based methods for advancing research and improving resistance training prescription.




