本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新488篇论文,其中:

  • 自然语言处理49
  • 信息检索10
  • 计算机视觉78

自然语言处理

1. 【2511.02834】Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

链接https://arxiv.org/abs/2511.02834

作者:Huawei Lin,Yunzhi Shi,Tong Geng,Weijie Zhao,Wei Wang,Ravender Pal Singh

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large aligned datasets, shown strong capabilities, fixed modality pairs, require costly fine-tuning, Multimodal large language

备注: 16 pages, 7 figures, 14 tables. Under Review

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.

2. 【2511.02833】In Good GRACEs: Principled Teacher Selection for Knowledge Distillation

链接https://arxiv.org/abs/2511.02833

作者:Abhishek Panigrahi,Bingbin Liu,Sadhika Malladi,Sham Kakade,Surbhi Goel

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:train smaller capable, combination requires expensive, student-task combination requires, specific student-task combination, Knowledge distillation

备注

点击查看摘要

Abstract:Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using the GRACE-selected teacher can improve the performance by up to 7.4% over naively using the best-performing teacher. Further, GRACE can provide guidance on crucial design choices in distillation, including (1) the best temperature to use when generating from the teacher, (2) the best teacher to use given a size constraint, and (3) the best teacher to use within a specific model family. Altogether, our findings demonstrate that GRACE can efficiently and effectively identify a strongly compatible teacher for a given student and provide fine-grained guidance on how to perform distillation.

3. 【2511.02817】Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

链接https://arxiv.org/abs/2511.02817

作者:Amanda Bertsch,Adithya Pratapa,Teruko Mitamura,Graham Neubig,Matthew R. Gormley

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:context lengths continue, full context length, model context lengths, lengths continue, continue to grow

备注: Preprint

点击查看摘要

Abstract:As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

4. 【2511.02805】MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

链接https://arxiv.org/abs/2511.02805

作者:Qianhao Yuan,Jie Lou,Zichao Li,Jiawei Chen,Yaojie Lu,Hongyu Lin,Le Sun,Debing Zhang,Xianpei Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Typical search agents, entire interaction history, Typical search, producing long, resulting in high

备注: Project page: [this https URL](https://github.com/icip-cas/MemSearcher)

点击查看摘要

Abstract:Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at this https URL

5. 【2511.02795】Can LLMs subtract numbers?

链接https://arxiv.org/abs/2511.02795

作者:Mayank Jobanputra,Nils Philipp Walter,Maitrey Mehta,Blerta Veseli,Evan Parker Kelly Chapple,Yifan Wang,Sneha Chetani,Ellie Pavlick,Antonio Vergari,Vera Demberg

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, present a systematic, systematic study, large language, subtraction

备注: Work-in-progress; MathNLP non-archival presentation

点击查看摘要

Abstract:We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($ab$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs' performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs' arithmetic capabilities in subtraction.

6. 【2511.02778】VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

链接https://arxiv.org/abs/2511.02778

作者:Kevin Qinghong Lin,Yuhao Zheng,Hangyu Ran,Dantong Zhu,Dongxing Mao,Linjie Li,Philip Torr,Alex Jinpeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:agent era, SVG code, precise and executable, executable medium, Code

备注: Project page: [this https URL](https://csu-jpg.github.io/VCode) Github: [this https URL](https://github.com/CSU-JPG/VCode)

点击查看摘要

Abstract:Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at this https URL.

7. 【2511.02770】Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval

链接https://arxiv.org/abs/2511.02770

作者:Hung-Ting Chen,Xiang Liu,Shauli Ravfogel,Eunsol Choi

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:emph, query, relevant documents, query vectors, text retrievers generate

备注

点击查看摘要

Abstract:Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.

8. 【2511.02755】Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

链接https://arxiv.org/abs/2511.02755

作者:Bowen Jin,TJ Collins,Donghan Yu,Mert Cemri,Shenao Zhang,Mengyu Li,Jay Tang,Tian Qin,Zhiyang Xu,Jiarui Lu,Guoli Yin,Jiawei Han,Zirui Wang

类目:Computation and Language (cs.CL)

关键词:Large language models, exhibit complementary strengths, Large language, models collaborate efficiently, specialized models collaborate

备注: 14 pages

点击查看摘要

Abstract:Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

9. 【2511.02752】AI Diffusion in Low Resource Language Countries

链接https://arxiv.org/abs/2511.02752

作者:Amit Misra,Syed Waqas Zamir,Wassim Hamidouche,Inbal Becker-Reshef,Juan Lavista Ferres

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Artificial intelligence, adoption remains uneven, unprecedented speed, remains uneven, Large Language Models

备注: 9 pages, 4 tables. Also available at [this https URL](https://aka.ms/AI_Diffusion_Low_Resource_Language_Countries)

点击查看摘要

Abstract:Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.

10. 【2511.02734】CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

链接https://arxiv.org/abs/2511.02734

作者:Jiayu Liu,Cheng Qian,Zhaochen Su,Qing Zong,Shijue Huang,Bingxiang He,Yi R. Fung

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, overlooking resource efficiency, Large Language, Current evaluations, emphasize task completion

备注

点击查看摘要

Abstract:Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

11. 【2511.02721】PragExTra: A Multilingual Corpus of Pragmatic Explicitation in Translation

链接https://arxiv.org/abs/2511.02721

作者:Doreen Osmelak,Koel Dutta Chowdhury,Uliana Sentsova,Cristina España-Bonet,Josef van Genabith

类目:Computation and Language (cs.CL)

关键词:make implicit cultural, implicit cultural meanings, cultural meanings explicit, enrich texts, texts with background

备注

点击查看摘要

Abstract:Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation

12. 【2511.02687】he Collaboration Gap

链接https://arxiv.org/abs/2511.02687

作者:Tim R. Davidson,Adam Fourney,Saleema Amershi,Robert West,Eric Horvitz,Ece Kamar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:agent-based systems composed, development suggests, increasingly rely, rely on agent-based, composed of independently

备注

点击查看摘要

Abstract:The trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with different information, privileges, and tools. The success of these systems will critically depend on effective collaboration among these heterogeneous agents, even under partial observability. Despite intense interest, few empirical studies have evaluated such agent-agent collaboration at scale. We propose a collaborative maze-solving benchmark that (i) isolates collaborative capabilities, (ii) modulates problem complexity, (iii) enables scalable automated grading, and (iv) imposes no output-format constraints, preserving ecological plausibility. Using this framework, we evaluate 32 leading open- and closed-source models in solo, homogeneous, and heterogeneous pairings. Our results reveal a "collaboration gap": models that perform well solo often degrade substantially when required to collaborate. Collaboration can break down dramatically; for instance, small distilled models that solve mazes well alone may fail almost completely in certain pairings. We find that starting with the stronger agent often improves outcomes, motivating a "relay inference" approach where the stronger agent leads before handing off to the weaker one, closing much of the gap. Our findings argue for (1) collaboration-aware evaluation, (2) training strategies developed to enhance collaborative capabilities, and (3) interaction design that reliably elicits agents' latent skills, guidance that applies to AI-AI and human-AI collaboration.

13. 【2511.02681】Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

链接https://arxiv.org/abs/2511.02681

作者:Mohammadsajad Alipour,Mohammad Mohammadi Amiri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, increasingly prevalent, prevalent across diverse, Large

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

14. 【2511.02626】Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation

链接https://arxiv.org/abs/2511.02626

作者:Renfei Dang,Peng Hu,Changjiang Gao,Shujian Huang

类目:Computation and Language (cs.CL)

关键词:Previous studies show, Previous studies, large language models, knowledge, fine-tuning can lead

备注

点击查看摘要

Abstract:Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model's attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model's attention on key entities, accompanied by improved performance.

15. 【2511.02623】he Realignment Problem: When Right becomes Wrong in LLMs

链接https://arxiv.org/abs/2511.02623

作者:Aakash Sen Sharma,Debdeep Sanyal,Vivek Srivastava,Shirish Karande,Murari Mandal

类目:Computation and Language (cs.CL)

关键词:Large Language Models, practice produces static, Large Language, current practice produces, produces static

备注: 23 Pages

点击查看摘要

Abstract:The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.

16. 【2511.02607】UniChange: Unifying Change Detection with Multimodal Large Language Model

链接https://arxiv.org/abs/2511.02607

作者:Xu Zhang,Danyang Li,Xiaohang Dong,Tianhao Wu,Hualong Yu,Jianye Wang,Qicheng Li,Xiang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:land cover dynamics, analyzing land cover, cover dynamics, Change detection, monitoring and analyzing

备注

点击查看摘要

Abstract:Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at this https URL.

17. 【2511.02603】CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

链接https://arxiv.org/abs/2511.02603

作者:Ehsan Aghazadeh,Ahmad Ghasemi,Hedyeh Beyhaghi,Hossein Pishro-Nik

类目:Computation and Language (cs.CL)

关键词:Large language models, queried multiple times, Large language, majority vote, multiple times

备注: Efficient Reasoning @ NeurIPS2025

点击查看摘要

Abstract:Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.

18. 【2511.02599】Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour

链接https://arxiv.org/abs/2511.02599

作者:Max Norris,Kobi Gal,Sahan Bulathwela

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Modelling student knowledge, Token Knowledge Tracing, Knowledge Tracing, Modelling student, key challenge

备注

点击查看摘要

Abstract:Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.

19. 【2511.02587】he Analysis of Lexical Errors in Machine Translation from English into Romanian

链接https://arxiv.org/abs/2511.02587

作者:Angela Stamatie(Dumitran)

类目:Computation and Language (cs.CL)

关键词:World Health Organization, patient information leaflet, include official information, Health Organization, Gavi Organization

备注: Doctoral thesis

点击查看摘要

Abstract:The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to Covid-19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively worked to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.

20. 【2511.02537】Smart-Hiring: An Explainable end-to-end Pipeline for CV Information Extraction and Job Matching

链接https://arxiv.org/abs/2511.02537

作者:Kenza Khelkhal,Dihia Lanasri

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, effort consuming, processes often involve, involve the manual, manual screening

备注

点击查看摘要

Abstract:Hiring processes often involve the manual screening of hundreds of resumes for each job, a task that is time and effort consuming, error-prone, and subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural Language Processing (NLP) pipeline de- signed to automatically extract structured information from unstructured resumes and to semantically match candidates with job descriptions. The proposed system combines document parsing, named-entity recognition, and contextual text embedding techniques to capture skills, experience, and qualifications. Using advanced NLP technics, Smart-Hiring encodes both resumes and job descriptions in a shared vector space to compute similarity scores between candidates and job postings. The pipeline is modular and explainable, allowing users to inspect extracted entities and matching rationales. Experiments were conducted on a real-world dataset of resumes and job descriptions spanning multiple professional domains, demonstrating the robustness and feasibility of the proposed approach. The system achieves competitive matching accuracy while preserving a high degree of interpretability and transparency in its decision process. This work introduces a scalable and practical NLP frame- work for recruitment analytics and outlines promising directions for bias mitigation, fairness-aware modeling, and large-scale deployment of data-driven hiring solutions.

21. 【2511.02495】DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

链接https://arxiv.org/abs/2511.02495

作者:Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:demonstrated strong performance, Recent advances, demonstrated strong, strong performance, Recent

备注: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, [this https URL](https://neurips.cc/virtual/2025/loc/san-diego/poster/121400)

点击查看摘要

Abstract:Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at this https URL

22. 【2511.02458】Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas

链接https://arxiv.org/abs/2511.02458

作者:Giulia Iadisernia,Carolina Camassa

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)

关键词:Large Language Model, improves Large Language, prompting improves Large, Language Model, Large Language

备注: 9 pages, 8-pages appendix, accepted at ICAIF 25

点击查看摘要

Abstract:We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.

23. 【2511.02451】Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance

链接https://arxiv.org/abs/2511.02451

作者:Kentaro Ueda,François Portet,Hirohiko Suwa,Keiichi Yasumoto

类目:Computation and Language (cs.CL)

关键词:domain-specific Continual Pre-training, requiring diverse skills, mathematical reasoning, specialized domains, requiring diverse

备注

点击查看摘要

Abstract:While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) "experts" offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.

24. 【2511.02376】AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

链接https://arxiv.org/abs/2511.02376

作者:Aashray Reddy,Andrew Zagula,Nicholas Saban

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, elicit harmful outputs, adversarial prompts elicit, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.

25. 【2511.02374】AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda

链接https://arxiv.org/abs/2511.02374

作者:Mohd Nauman,Sravan Gvm,Vijay Devane,Shyam Pawar,Viraj Thakur,Kundeshwar Pundalik,Piyush Sawarkar,Rohit Saluja,Maunendra Desarkar,Ganesh Ramakrishnan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Current large language, require deep cultural, Current large, general-purpose tasks, excel at broad

备注

点击查看摘要

Abstract:Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam's dataset incorporates context-aware, reasoning, and objective-style QA in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5--3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.

26. 【2511.02366】LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context

链接https://arxiv.org/abs/2511.02366

作者:Yudong Li,Zhongliang Yang,Kejiang Chen,Wenxuan Wang,Tianxin Zhang,Sifang Wan,Kecheng Wang,Haitian Li,Xu Wang,Lefan Cheng,Youdan Yang,Baocheng Chen,Ziyu Liu,Yufei Sun,Liyan Wu,Wenya Wen,Xingchi Gu,Peiru Yang

类目:Computation and Language (cs.CL)

关键词:Chinese-language LLM application, LLM application scenarios, continuously updated safety, specifically for Chinese-language, Chinese-language LLM

备注

点击查看摘要

Abstract:In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at this https URL.

27. 【2511.02360】CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

链接https://arxiv.org/abs/2511.02360

作者:Jizheng Ma,Xiaofei Zhou,Yanlong Song,Han Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:numerous thought processes, exist numerous thought, human cognition, verbal expression, exist numerous

备注

点击查看摘要

Abstract:In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

28. 【2511.02358】Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

链接https://arxiv.org/abs/2511.02358

作者:Wongyu Kim,Hochang Lee,Sanghak Lee,Yoonsung Kim,Jaehyun Park

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:find relevant documents, Large Language Model, Query augmentation, Language Model, Query

备注: Accepted to MMGenSR Workshop (CIKM 2025)

点击查看摘要

Abstract:Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

29. 【2511.02347】LTD-Bench: Evaluating Large Language Models by Letting Them Draw

链接https://arxiv.org/abs/2511.02347

作者:Liuhao Lin,Ke Li,Zihan Xu,Yuchen Shi,Yulei Qin,Yan Zhang,Xing Sun,Rongrong Ji

类目:Computation and Language (cs.CL)

关键词:Current evaluation paradigms, opaque numerical metrics, Current evaluation, critical blind spot, relying on opaque

备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.

30. 【2511.02304】Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

链接https://arxiv.org/abs/2511.02304

作者:Beyazit Yalcinkaya,Marcell Vazquez-Chanlatte,Ameesh Shah,Hanna Krasowski,Sanjit A. Seshia

类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)

关键词:temporal objectives, centralized training, study the problem, decentralized execution, learning multi-task

备注

点击查看摘要

Abstract:We study the problem of learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized execution. In this setting, using automata to represent tasks enables the decomposition of complex tasks into simpler sub-tasks that can be assigned to agents. However, existing approaches remain sample-inefficient and are limited to the single-task case. In this work, we present Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning (ACC-MARL), a framework for learning task-conditioned, decentralized team policies. We identify the main challenges to ACC-MARL's feasibility in practice, propose solutions, and prove the correctness of our approach. We further show that the value functions of learned policies can be used to assign tasks optimally at test time. Experiments show emergent task-aware, multi-step coordination among agents, e.g., pressing a button to unlock a door, holding the door, and short-circuiting tasks.

31. 【2511.02303】Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

链接https://arxiv.org/abs/2511.02303

作者:Zhiwei Zhang,Xiaomin Li,Yudi Lin,Hui Liu,Ramraj Chandradevan,Linlin Wu,Minhua Lin,Fali Wang,Xianfeng Tang,Qi He,Suhang Wang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, achieved strong results, trained with reinforcement

备注

点击查看摘要

Abstract:Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.

32. 【2511.02288】Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions

链接https://arxiv.org/abs/2511.02288

作者:Cuong Tuan Nguyen,Ngoc Tuan Nguyen,Triet Hoang Minh Dao,Huy Minh Nhat,Huy Truong Dinh

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Handwritten Mathematical Expression, Mathematical Expression, Handwritten Mathematical, Graph Neural Network, capture spatial dependencies

备注: accepted for ICDAR2025-WML

点击查看摘要

Abstract:We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.

33. 【2511.02280】SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

链接https://arxiv.org/abs/2511.02280

作者:Fangxun Shu,Yongjie Ye,Yue Liao,Zijian Kang,Weijie Yin,Jiacong Wang,Xiao Liang,Shuicheng Yan,Chao Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:multimodal large language, large language models, reinforcement learning, large language, introduce SAIL-RL

备注

点击查看摘要

Abstract:We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at this https URL.

34. 【2511.02246】Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

链接https://arxiv.org/abs/2511.02246

作者:Jonathan Liu,Haoling Qiu,Jonathan Lasko,Damianos Karakos,Mahsa Yarmohammadi,Mark Dredze

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:Recent research, research has shown, biases are prevalent, prevalent in everyday, everyday use-cases

备注

点击查看摘要

Abstract:Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: this https URL.

35. 【2511.02234】An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM

链接https://arxiv.org/abs/2511.02234

作者:Jiawei Liu,Enis Berk Çoban,Zarina Schevchenko,Hao Tang,Zhigang Zhu,Michael I Mandel,Johanna Devaney

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, involves concatenating non-textual, concatenating non-textual information

备注

点击查看摘要

Abstract:Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.

36. 【2511.02213】IG-Pruning: Input-Guided Block Pruning for Large Language Models

链接https://arxiv.org/abs/2511.02213

作者:Kangyu Qiao,Shaolei Zhang,Yang Feng

类目:Computation and Language (cs.CL)

关键词:large language models, growing computational demands, large language, language models, increasingly critical

备注: Accepted to EMNLP 2025. Code is available at [this https URL](https://github.com/ictnlp/IG-Pruning)

点击查看摘要

Abstract:With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.

37. 【2511.02208】raining Proactive and Personalized LLM Agents

链接https://arxiv.org/abs/2511.02208

作者:Weiwei Sun,Xuhui Zhou,Weihua Du,Xingyao Wang,Sean Welleck,Graham Neubig,Maarten Sap,Yiming Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:user preferences, existing work focuses, work focuses primarily, diverse user preferences, real-world agents require

备注

点击查看摘要

Abstract:While existing work focuses primarily on task success, we argue that effective real-world agents require optimizing three dimensions: productivity (task completion), proactivity (asking essential questions), and personalization (adapting to diverse user preferences). We introduce UserVille, an interactive environment with LLM-based user simulators enabling diverse, configurable user preferences. Leveraging UserVille, we introduce PPP, a multi-objective reinforcement learning approach that jointly optimizes all three dimensions: Productivity, Proactivity, and Personalization. Experiments on software engineering and deep research tasks show that agents trained with PPP achieve substantial improvements over strong baselines such as GPT-5 (+21.6 on average), demonstrating the ability to ask strategic clarifying questions, adapt to unseen user preferences, and improve task success through better interaction. This work demonstrates that explicitly optimizing for user-centered interaction is critical for building practical and effective AI agents.

38. 【2511.02194】Personalized Decision Modeling: Utility Optimization or Textualized-Symbolic Reasoning

链接https://arxiv.org/abs/2511.02194

作者:Yibo Zhao,Yang Zhao,Hongru Du,Hao Frank Yang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:population optimal predictions, individual decision-making process, high-stakes scenarios, diverge from population, Adaptive Textual-symbolic Human-centric

备注

点击查看摘要

Abstract:Decision-making models for individuals, particularly in high-stakes scenarios like vaccine uptake, often diverge from population optimal predictions. This gap arises from the uniqueness of the individual decision-making process, shaped by numerical attributes (e.g., cost, time) and linguistic influences (e.g., personal preferences and constraints). Developing upon Utility Theory and leveraging the textual-reasoning capabilities of Large Language Models (LLMs), this paper proposes an Adaptive Textual-symbolic Human-centric Reasoning framework (ATHENA) to address the optimal information integration. ATHENA uniquely integrates two stages: First, it discovers robust, group-level symbolic utility functions via LLM-augmented symbolic discovery; Second, it implements individual-level semantic adaptation, creating personalized semantic templates guided by the optimal utility to model personalized choices. Validated on real-world travel mode and vaccine choice tasks, ATHENA consistently outperforms utility-based, machine learning, and other LLM-based models, lifting F1 score by at least 6.5% over the strongest cutting-edge models. Further, ablation studies confirm that both stages of ATHENA are critical and complementary, as removing either clearly degrades overall predictive performance. By organically integrating symbolic utility modeling and semantic adaptation, ATHENA provides a new scheme for modeling human-centric decisions. The project page can be found at this https URL.

39. 【2511.02135】Rethinking LLM Human Simulation: When a Graph is What You Need

链接https://arxiv.org/abs/2511.02135

作者:Joseph Suh,Suhong Moon,Serina Chang

类目:Computation and Language (cs.CL)

关键词:applications ranging, ranging from survey, Large language models, domain-grounded models suffice, simulation

备注: Code: [this https URL](https://github.com/schang-lab/gems)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at this https URL.

40. 【2511.02119】InsurAgent: A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance

链接https://arxiv.org/abs/2511.02119

作者:Ziheng Geng,Jiachen Liu,Ran Cao,Lu Cheng,Dan M. Frangopol,Minghui Cheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:mitigate disaster-related losses, United States remain, Flood insurance, disaster-related losses, effective strategy

备注

点击查看摘要

Abstract:Flood insurance is an effective strategy for individuals to mitigate disaster-related losses. However, participation rates among at-risk populations in the United States remain strikingly low. This gap underscores the need to understand and model the behavioral mechanisms underlying insurance decisions. Large language models (LLMs) have recently exhibited human-like intelligence across wide-ranging tasks, offering promising tools for simulating human decision-making. This study constructs a benchmark dataset to capture insurance purchase probabilities across factors. Using this dataset, the capacity of LLMs is evaluated: while LLMs exhibit a qualitative understanding of factors, they fall short in estimating quantitative probabilities. To address this limitation, InsurAgent, an LLM-empowered agent comprising five modules including perception, retrieval, reasoning, action, and memory, is proposed. The retrieval module leverages retrieval-augmented generation (RAG) to ground decisions in empirical survey data, achieving accurate estimation of marginal and bivariate probabilities. The reasoning module leverages LLM common sense to extrapolate beyond survey data, capturing contextual information that is intractable for traditional models. The memory module supports the simulation of temporal decision evolutions, illustrated through a roller coaster life trajectory. Overall, InsurAgent provides a valuable tool for behavioral modeling and policy analysis.

41. 【2511.02109】Deep Value Benchmark: Measuring Whether Models Generalize Deep values or Shallow Preferences

链接https://arxiv.org/abs/2511.02109

作者:Joshua Ashkinaze,Hua Shen,Sai Avula,Eric Gilbert,Ceren Budak

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:learn fundamental human, Deep Value Benchmark, learn fundamental, evaluation framework, framework that directly

备注: NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:We introduce the Deep Value Benchmark (DVB), an evaluation framework that directly tests whether large language models (LLMs) learn fundamental human values or merely surface-level preferences. This distinction is critical for AI alignment: Systems that capture deeper values are likely to generalize human intentions robustly, while those that capture only superficial patterns in preference data risk producing misaligned behavior. The DVB uses a novel experimental design with controlled confounding between deep values (e.g., moral principles) and shallow features (e.g., superficial attributes). In the training phase, we expose LLMs to human preference data with deliberately correlated deep and shallow features -- for instance, where a user consistently prefers (non-maleficence, formal language) options over (justice, informal language) alternatives. The testing phase then breaks these correlations, presenting choices between (justice, formal language) and (non-maleficence, informal language) options. This design allows us to precisely measure a model's Deep Value Generalization Rate (DVGR) -- the probability of generalizing based on the underlying value rather than the shallow feature. Across 9 different models, the average DVGR is just 0.30. All models generalize deep values less than chance. Larger models have a (slightly) lower DVGR than smaller models. We are releasing our dataset, which was subject to three separate human validation experiments. DVB provides an interpretable measure of a core feature of alignment.

42. 【2511.02089】LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

链接https://arxiv.org/abs/2511.02089

作者:Stefan F. Schouten,Peter Bloem

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:represent binary features, large language models, language models represent, models represent binary, Contrast-Consistent Search

备注: Accepted to the Mechanistic Interpretability Workshop at NeurIPS 2025

点击查看摘要

Abstract:Contrast-Consistent Search (CCS) is an unsupervised probing method able to test whether large language models represent binary features, such as sentence truth, in their internal activations. While CCS has shown promise, its two-term objective has been only partially understood. In this work, we revisit CCS with the aim of clarifying its mechanisms and extending its applicability. We argue that what should be optimized for, is relative contrast consistency. Building on this insight, we reformulate CCS as an eigenproblem, yielding closed-form solutions with interpretable eigenvalues and natural extensions to multiple variables. We evaluate these approaches across a range of datasets, finding that they recover similar performance to CCS, while avoiding problems around sensitivity to random initialization. Our results suggest that relativizing contrast consistency not only improves our understanding of CCS but also opens pathways for broader probing and mechanistic interpretability methods.

43. 【2511.02044】Regularization Through Reasoning: Systematic Improvements in Language Model Classification via Explanation-Enhanced Fine-Tuning

链接https://arxiv.org/abs/2511.02044

作者:Vivswan Shah,Randy Cogill,Hanwei Yue,Gopinath Chennupati,Rinat Khaziev

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:typically maps inputs, maps inputs directly, classification typically maps, typically maps, maps inputs

备注

点击查看摘要

Abstract:Fine-tuning LLMs for classification typically maps inputs directly to labels. We ask whether attaching brief explanations to each label during fine-tuning yields better models. We evaluate conversational response quality along three axes: naturalness, comprehensiveness, and on-topic adherence, each rated on 5-point scales. Using ensemble-generated data from multiple LLMs, we fine-tune a 7B-parameter model and test across six diverse conversational datasets. Across 18 dataset, task settings, label-plus-explanation training outperforms label-only baselines. A central and unexpected result concerns random tokens. We replace human-written explanations with text that is syntactically incoherent yet vocabulary-aligned with the originals (e.g., shuffled or bag-of-words variants). Despite lacking semantics, these pseudo-explanations still improve accuracy over label-only training and often narrow much of the gap to true explanations. The effect persists across datasets and training seeds, indicating that gains arise less from meaning than from structure: the extra token budget encourages richer intermediate computation and acts as a regularizer that reduces over-confident shortcuts. Internal analyses support this view: explanation-augmented models exhibit higher activation entropy in intermediate layers alongside sharper predictive mass at the output layer, consistent with increased deliberation before decision. Overall, explanation-augmented fine-tuning, whether with genuine rationales or carefully constructed random token sequences, improves accuracy and reliability for LLM classification while clarifying how token-level scaffolding shapes computation during inference.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2511.02044 [cs.LG]

(or
arXiv:2511.02044v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2511.02044

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
44. 【2511.02017】apOut: A Bandit-Based Approach to Dynamic Speculative Decoding

链接https://arxiv.org/abs/2511.02017

作者:Aditya Sridhar,Nish Sinnadurai,Sean Lie,Vithursan Thangarasa

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:decoding accelerates LLMs, generate tokens autoregressively, Speculative decoding accelerates, larger target model, accelerates LLMs

备注: 9 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.

45. 【2511.01892】Retrieval-Augmented Multimodal Depression Detection

链接https://arxiv.org/abs/2511.01892

作者:Ruibo Hou,Shiyu Teng,Jiaqing Liu,Shurong Chai,Yinhao Li,Lanfen Lin,Yen-Wei Chen

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Multimodal deep learning, Multimodal deep, video signals, shown promise, promise in depression

备注: Accepted in IEEE EMBC 2025

点击查看摘要

Abstract:Multimodal deep learning has shown promise in depression detection by integrating text, audio, and video signals. Recent work leverages sentiment analysis to enhance emotional understanding, yet suffers from high computational cost, domain mismatch, and static knowledge limitations. To address these issues, we propose a novel Retrieval-Augmented Generation (RAG) framework. Given a depression-related text, our method retrieves semantically relevant emotional content from a sentiment dataset and uses a Large Language Model (LLM) to generate an Emotion Prompt as an auxiliary modality. This prompt enriches emotional representation and improves interpretability. Experiments on the AVEC 2019 dataset show our approach achieves state-of-the-art performance with CCC of 0.593 and MAE of 3.95, surpassing previous transfer learning and multi-task learning baselines.

46. 【2511.01891】Multi-Personality Generation of LLMs at Decoding-time

链接https://arxiv.org/abs/2511.01891

作者:Rongxin Chen,Yunfan Li,Yige Yuan,Bingbing Xu,Huawei Shen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enabling simultaneous embodiment, multiple personalization attributes, enabling simultaneous, personalization attributes, fundamental challenge

备注: WSDM2026

点击查看摘要

Abstract:Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a "free lunch" to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at this https URL .

47. 【2511.01884】CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

链接https://arxiv.org/abs/2511.01884

作者:Zijian Zhang,Rong Wang,Shiyang Li,Yuebo Luo,Mingyi Hong,Caiwen Ding

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:large-scale LLM training, Developing efficient CUDA, efficient CUDA kernels, LLM training, large-scale LLM

备注

点击查看摘要

Abstract:Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6\% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench. Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \$ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and \$ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at this https URL

48. 【2404.13765】SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model

链接https://arxiv.org/abs/2404.13765

作者:Xingbo Wang,Samantha L. Huey,Rui Sheng,Saurabh Mehta,Fei Wang

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:advancing scientific knowledge, supporting evidence-based decision-making, scientific literature, advancing scientific, scientific knowledge

备注: Preprint version of the paper accepted to Campbell Systematic Reviews. Code is available at [this https URL](https://github.com/xingbow/SciDaEx)

点击查看摘要

Abstract:The explosion of scientific literature has made the efficient and accurate extraction of structured data a critical component for advancing scientific knowledge and supporting evidence-based decision-making. However, existing tools often struggle to extract and structure multimodal, varied, and inconsistent information across documents into standardized formats. We introduce SciDaSynth, a novel interactive system powered by large language models (LLMs) that automatically generates structured data tables according to users' queries by integrating information from diverse sources, including text, tables, and figures. Furthermore, SciDaSynth supports efficient table data validation and refinement, featuring multi-faceted visual summaries and semantic grouping capabilities to resolve cross-document data inconsistencies. A within-subjects study with nutrition and NLP researchers demonstrates SciDaSynth's effectiveness in producing high-quality structured data more efficiently than baseline methods. We discuss design implications for human-AI collaborative systems supporting data extraction tasks. The system code is available at this https URL

49. 【2511.02069】Complete asymptotic type-token relationship for growing complex systems with inverse power-law count rankings

链接https://arxiv.org/abs/2511.02069

作者:Pablo Rosillo-Rodes,Laurent Hébert-Dufresne,Peter Sheridan Dodds

类目:Physics and Society (physics.soc-ph); Computation and Language (cs.CL)

关键词:exhibit statistical regularities, statistical regularities involving, regularities involving power-law, involving power-law relationships, growth dynamics

备注: 5 pages, 2 figures

点击查看摘要

Abstract:The growth dynamics of complex systems often exhibit statistical regularities involving power-law relationships. For real finite complex systems formed by countable tokens (animals, words) as instances of distinct types (species, dictionary entries), an inverse power-law scaling $S \sim r^{-\alpha}$ between type count $S$ and type rank $r$, widely known as Zipf's law, is widely observed to varying degrees of fidelity. A secondary, summary relationship is Heaps' law, which states that the number of types scales sublinearly with the total number of observed tokens present in a growing system. Here, we propose an idealized model of a growing system that (1) deterministically produces arbitrary inverse power-law count rankings for types, and (2) allows us to determine the exact asymptotics of the type-token relationship. Our argument improves upon and remedies earlier work. We obtain a unified asymptotic expression for all values of $\alpha$, which corrects the special cases of $\alpha = 1$ and $\alpha \gg 1$. Our approach relies solely on the form of count rankings, avoids unnecessary approximations, and does not involve any stochastic mechanisms or sampling processes. We thereby demonstrate that a general type-token relationship arises solely as a consequence of Zipf's law.

信息检索

1. 【2511.02770】Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval

链接https://arxiv.org/abs/2511.02770

作者:Hung-Ting Chen,Xiang Liu,Shauli Ravfogel,Eunsol Choi

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:emph, query, relevant documents, query vectors, text retrievers generate

备注

点击查看摘要

Abstract:Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.

2. 【2511.02711】Relational Deep Dive: Error-Aware Queries Over Unstructured Data

链接https://arxiv.org/abs/2511.02711

作者:Daren Chao,Kaiwen Chen,Naiqing Guan,Nick Koudas

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:demand structured representations, significant extraction challenge, queries demand structured, structured representations, creating a significant

备注

点击查看摘要

Abstract:Unstructured data is pervasive, but analytical queries demand structured representations, creating a significant extraction challenge. Existing methods like RAG lack schema awareness and struggle with cross-document alignment, leading to high error rates. We propose ReDD (Relational Deep Dive), a framework that dynamically discovers query-specific schemas, populates relational tables, and ensures error-aware extraction with provable guarantees. ReDD features a two-stage pipeline: (1) Iterative Schema Discovery (ISD) identifies minimal, joinable schemas tailored to each query, and (2) Tabular Data Population (TDP) extracts and corrects data using lightweight classifiers trained on LLM hidden states. A main contribution of ReDD is SCAPE, a statistically calibrated method for error detection with coverage guarantees, and SCAPE-HYB, a hybrid approach that optimizes the trade-off between accuracy and human correction costs. Experiments across diverse datasets demonstrate ReDD's effectiveness, reducing data extraction errors from up to 30% to below 1% while maintaining high schema completeness (100% recall) and precision. ReDD's modular design enables fine-grained control over accuracy-cost trade-offs, making it a robust solution for high-stakes analytical queries over unstructured corpora.

3. 【2511.02571】Average Precision at Cutoff k under Random Rankings: Expectation and Variance

链接https://arxiv.org/abs/2511.02571

作者:Tetiana Manzhos,Tetiana Ianevych,Olga Melnyk

类目:Information Retrieval (cs.IR); Probability (math.PR)

关键词:retrieval platforms rely, Recommender systems, information retrieval platforms, engagement and satisfaction, platforms rely

备注: 17 pages, 2 tables, 2 figures

点击查看摘要

Abstract:Recommender systems and information retrieval platforms rely on ranking algorithms to present the most relevant items to users, thereby improving engagement and satisfaction. Assessing the quality of these rankings requires reliable evaluation metrics. Among them, Mean Average Precision at cutoff k (MAP@k) is widely used, as it accounts for both the relevance of items and their positions in the list. In this paper, the expectation and variance of Average Precision at k (AP@k) are derived since they can be used as biselines for MAP@k. Here, we covered two widely used evaluation models: offline and online. The expectation establishes the baseline, indicating the level of MAP@k that can be achieved by pure chance. The variance complements this baseline by quantifying the extent of random fluctuations, enabling a more reliable interpretation of observed scores.

Comments:
17 pages, 2 tables, 2 figures

Subjects:

Information Retrieval (cs.IR); Probability (math.PR)

MSC classes:
Primary 60E05, 60C05, Secondary 62R07, 68T05

Cite as:
arXiv:2511.02571 [cs.IR]

(or
arXiv:2511.02571v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2511.02571

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2511.02358】Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

链接https://arxiv.org/abs/2511.02358

作者:Wongyu Kim,Hochang Lee,Sanghak Lee,Yoonsung Kim,Jaehyun Park

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:find relevant documents, Large Language Model, Query augmentation, Language Model, Query

备注: Accepted to MMGenSR Workshop (CIKM 2025)

点击查看摘要

Abstract:Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

5. 【2511.02296】Library and Culture: A Scientometric Analysis and Visualization of Research Trends

链接https://arxiv.org/abs/2511.02296

作者:Auwalu Abdullahi Umar,Muneer Ahmad,Dr M Sadik Batcha

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:preserving and maintaining, maintaining history, history and traditional, traditional culture, publications

备注: 8 pages, 3 figures, Research Article

点击查看摘要

Abstract:The significance of libraries in preserving and maintaining history and traditional culture cannot be overlooked. It is from this purpose that libraries are to envisage in their programmes cultural activities which must be collected, documented and preserved for posterity. The usefulness of preserved information lies in the fact that the generation to come will be able to establish their identity. This will also assist them with a foundation to build from. This study focus on the growth and development of Library and Culture research in forms of publications reflected in Web of Science database, during the span of 2010-2019. A total 890 publications were found and the highest 124 (13.93%) publications published in this http URL analysis maps comprehensively the parameters of total output, growth of output, authorship, institution wise and country-level collaboration patterns, major contributors (individuals, top publication sources, institutions, and countries). It exposed that the most prolific author is Lo P secured first place by contributing 4 (0.45%) publications, followed by Bressan V 3 (0.34%) publications in Library and Culture literature. Journal of Academic Librarianship produced the highest number of records 29 (3.26%) followed by Australian Library Journal having contributed 21 (2.36%).It is identified the domination of Wuhan University; School Information Management had contributed 6 (0.67%) of total research output. Authors from USA published the highest number of publications with a total of 244 (27.42%), followed by UK and Australia with 118 (13.26%) and 76 (8.54%) publications were produced respectively.

6. 【2511.02275】Research Output on Alopecia Areata Disease: A Scientometric Analysis of Publications from 2010 to 2019

链接https://arxiv.org/abs/2511.02275

作者:Muneer Ahmad,M Sadik Batcha

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Alopecia Areata Disease, Areata Disease, Alopecia Areata, Areata Disease researchers, global perspective

备注: 16 pages, 3 figures, Research Paper

点击查看摘要

Abstract:The present study is undertaken to find out the publication trends on Alopecia Areata Disease during 2010-2019 from the global perspective. The study mainly focus on distribution of research output, top journals for publications, most prolific authors, authorship pattern, and citations pattern on Alopecia Areata Disease. The results indicate that highest growth rate of publications occurred during the year 2019. Columbia University topped the scene among all institutes. The maximum publications were more than four authored publications. Christiano AM and Clynes R were found to be the most prolific authors. It is also found that most of the prolific authors (by number of publications) do appear in highly cited publications list. Alopecia Areata Disease researchers mostly preferred using article publications to communicate their findings.

7. 【2511.02181】KGBridge: Knowledge-Guided Prompt Learning for Non-overlapping Cross-Domain Recommendation

链接https://arxiv.org/abs/2511.02181

作者:Yuhan Wang,Qing Xie,Zhifeng Bao,Mengzi Tang,Lin Li,Yongjian Liu

类目:Information Retrieval (cs.IR)

关键词:organize relational information, unified semantic foundation, Knowledge Graphs, structured knowledge bases, bases that organize

备注: 13 pages, 4 figures

点击查看摘要

Abstract:Knowledge Graphs (KGs), as structured knowledge bases that organize relational information across diverse domains, provide a unified semantic foundation for cross-domain recommendation (CDR). By integrating symbolic knowledge with user-item interactions, KGs enrich semantic representations, support reasoning, and enhance model interpretability. Despite this potential, existing KG-based methods still face major challenges in CDR, particularly under non-overlapping user scenarios. These challenges arise from: (C1) sensitivity to KG sparsity and popularity bias, (C2) dependence on overlapping users for domain alignment and (C3) lack of explicit disentanglement between transferable and domain-specific knowledge, which limit effective and stable knowledge transfer. To this end, we propose KGBridge, a knowledge-guided prompt learning framework for cross-domain sequential recommendation under non-overlapping user scenarios. KGBridge comprises two core components: a KG-enhanced Prompt Encoder, which models relation-level semantics as soft prompts to provide structured and dynamic priors for user sequence modeling (addressing C1), and a Two-stage Training Paradigm, which combines cross-domain pretraining and privacy-preserving fine-tuning to enable knowledge transfer without user overlap (addressing C2). By combining relation-aware semantic control with correspondence-driven disentanglement, KGBridge explicitly separates and balances domain-shared and domain-specific semantics, thereby maintaining complementarity and stabilizing adaptation during fine-tuning (addressing C3). Extensive experiments on benchmark datasets demonstrate that KGBridge consistently outperforms state-of-the-art baselines and remains robust under varying KG sparsity, highlighting its effectiveness in mitigating structural imbalance and semantic entanglement in KG-enhanced cross-domain recommendation.

8. 【2511.02113】Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion

链接https://arxiv.org/abs/2511.02113

作者:Hai-Dang Kieu,Min Xu,Thanh Trung Huynh,Dung D. Le

类目:Information Retrieval (cs.IR)

关键词:incorporating rich content, rich content sources, gains representation quality, significant gains representation, shown that incorporating

备注

点击查看摘要

Abstract:Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.

9. 【2511.02052】Solving cold start in news recommendations: a RippleNet-based system for large scale media outlet

链接https://arxiv.org/abs/2511.02052

作者:Karol Radziszewski,Michał Szpunar,Piotr Ociepka,Mateusz Buczyński

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Poland largest online, online media platforms, largest online media, scalable recommender system, recommender system implementation

备注

点击查看摘要

Abstract:We present a scalable recommender system implementation based on RippleNet, tailored for the media domain with a production deployment in this http URL, one of Poland's largest online media platforms. Our solution addresses the cold-start problem for newly published content by integrating content-based item embeddings into the knowledge propagation mechanism of RippleNet, enabling effective scoring of previously unseen items. The system architecture leverages Amazon SageMaker for distributed training and inference, and Apache Airflow for orchestrating data pipelines and model retraining workflows. To ensure high-quality training data, we constructed a comprehensive golden dataset consisting of user and item features and a separate interaction table, all enabling flexible extensions and integration of new signals.

10. 【2511.02002】InteracSPARQL: An Interactive System for SPARQL Query Refinement Using Natural Language Explanations

链接https://arxiv.org/abs/2511.02002

作者:Xiangru Jian,Zhengyuan Dong,M. Tamer Özsu

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:intricate data structures, querying semantic web, semantic web data, understanding intricate data, data structures

备注: Working paper

点击查看摘要

Abstract:In recent years, querying semantic web data using SPARQL has remained challenging, especially for non-expert users, due to the language's complex syntax and the prerequisite of understanding intricate data structures. To address these challenges, we propose InteracSPARQL, an interactive SPARQL query generation and refinement system that leverages natural language explanations (NLEs) to enhance user comprehension and facilitate iterative query refinement. InteracSPARQL integrates LLMs with a rule-based approach to first produce structured explanations directly from SPARQL abstract syntax trees (ASTs), followed by LLM-based linguistic refinements. Users can interactively refine queries through direct feedback or LLM-driven self-refinement, enabling the correction of ambiguous or incorrect query components in real time. We evaluate InteracSPARQL on standard benchmarks, demonstrating significant improvements in query accuracy, explanation clarity, and overall user satisfaction compared to baseline approaches. Our experiments further highlight the effectiveness of combining rule-based methods with LLM-driven refinements to create more accessible and robust SPARQL interfaces.

计算机视觉

1. 【2511.02832】WIST2: Scalable, Portable, and Holistic Humanoid Data Collection System

链接https://arxiv.org/abs/2511.02832

作者:Yanjie Ze,Siheng Zhao,Weizhuo Wang,Angjoo Kanazawa,Rocky Duan,Pieter Abbeel,Guanya Shi,Jiajun Wu,C. Karen Liu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large-scale data, language models, driven breakthroughs, Large-scale, models

备注: Website: [this https URL](https://yanjieze.com/TWIST2)

点击查看摘要

Abstract:Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at this https URL . Our collected dataset is also open-sourced at this https URL .

2. 【2511.02830】Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

链接https://arxiv.org/abs/2511.02830

作者:Dmitrii Pozdeev,Alexey Artemov,Ananta R. Bhattarai,Artem Sevastopolsky

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling high-quality dense, high-quality dense correspondences, Vision Transformer network, human head images, propose DenseMarks

备注: Project page: [this https URL](https://diddone.github.io/densemarks/) .Video: [this https URL](https://youtu.be/o8DOOYFW0gI) .21 pages, 13 figures, 2 tables

点击查看摘要

Abstract:We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

3. 【2511.02826】PLUTO-4: Frontier Pathology Foundation Models

链接https://arxiv.org/abs/2511.02826

作者:Harshith Padigela,Shima Nofallah,Atchuth Naveen Chilaparasetti,Ryun Han,Andrew Walker,Judy Shen,Chintan Shah,Blake Martin,Aashish Sood,Elliot Miller,Ben Glass,Andy Beck,Harsha Pokkalla,Syed Ashar Javed

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated strong transfer, strong transfer capabilities, large-scale pathology image, pathology image corpora, Foundation models trained

备注

点击查看摘要

Abstract:Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

4. 【2511.02791】AI-Generated Image Detection: An Empirical Study and Future Research Directions

链接https://arxiv.org/abs/2511.02791

作者:Nusrat Tasnim,Kutub Uddin,Khalid Mahmood Malik

类目:Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)

关键词:raising significant challenges, biometric system resulting, social engineering attacks, raising significant, significant challenges

备注

点击查看摘要

Abstract:The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.

5. 【2511.02779】When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

链接https://arxiv.org/abs/2511.02779

作者:Yiyang Zhou,Haoqin Tu,Zijun Wang,Zeyu Wang,Niklas Muennighoff,Fan Nie,Yejin Choi,James Zou,Chaorui Deng,Shen Yan,Haoqi Fan,Cihang Xie,Huaxiu Yao,Qinghao Ye

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:designed to evaluate, scenarios where generating, MIRA, intermediate visual images, generating intermediate visual

备注: 28 pages, 15 figures

点击查看摘要

Abstract:We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

6. 【2511.02778】VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

链接https://arxiv.org/abs/2511.02778

作者:Kevin Qinghong Lin,Yuhao Zheng,Hangyu Ran,Dantong Zhu,Dongxing Mao,Linjie Li,Philip Torr,Alex Jinpeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:agent era, SVG code, precise and executable, executable medium, Code

备注: Project page: [this https URL](https://csu-jpg.github.io/VCode) Github: [this https URL](https://github.com/CSU-JPG/VCode)

点击查看摘要

Abstract:Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at this https URL.

7. 【2511.02777】PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction Editing

链接https://arxiv.org/abs/2511.02777

作者:Antonio Oroz,Matthias Nießner,Tobias Kirschstein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe view occlusions, inherently challenging due, https URL Video, present PercHead, method for single-image

备注: Project Page: [this https URL](https://antoniooroz.github.io/PercHead/) Video: [this https URL](https://www.youtube.com/watch?v=4hFybgTk4kE)

点击查看摘要

Abstract:We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: this https URL Video: this https URL

Comments:
Project Page: this https URL Video: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2511.02777 [cs.CV]

(or
arXiv:2511.02777v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.02777

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2511.02767】Dynamic Reflections: Probing Video Representations with Text Alignment

链接https://arxiv.org/abs/2511.02767

作者:Tyler Zhu,Tengda Han,Leonidas Guibas,Viorica Pătrăucean,Maks Ovsjanikov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse data types, modalities has recently, recently been shown, shown to provide, structural similarities

备注: 21 pages, 12 figures

点击查看摘要

Abstract:The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at this https URL

9. 【2511.02720】LLEXICORP: End-user Explainability of Convolutional Neural Networks

链接https://arxiv.org/abs/2511.02720

作者:Vojtěch Kůr,Adam Bajger,Adam Kukučka,Marek Hradil,Vít Musil,Tomáš Brázdil

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Convolutional neural networks, underpin many modern, Convolutional neural, Concept relevance propagation, modern computer vision

备注

点击查看摘要

Abstract:Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2511.02720 [cs.CV]

(or
arXiv:2511.02720v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.02720

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2511.02712】VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

链接https://arxiv.org/abs/2511.02712

作者:Zhicheng Zhang,Weicheng Wang,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:video large language, gathered significant attention, large language models, recent studies, driven by advancements

备注: 41 pages, 26 figures

点击查看摘要

Abstract:Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

11. 【2511.02685】Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification

链接https://arxiv.org/abs/2511.02685

作者:Chao Yuan,Zanwu Liu,Guiwei Zhang,Haoxuan Xu,Yujian Zhao,Guanglin Niu,Bo Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visible-infrared person re-identification, Visible-infrared person, technique could associate, associate the pedestrian, practical scenarios

备注

点击查看摘要

Abstract:Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.

12. 【2511.02652】Differentiable Hierarchical Visual Tokenization

链接https://arxiv.org/abs/2511.02652

作者:Marius Aasan,Martine Hjelkrem-Tan,Nico Catalano,Changkyu Choi,Adín Ramírez Rivera

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers rely, Vision Transformers, fixed patch tokens, Transformers rely, rely on fixed

备注: NeurIPS 2025 Spotlight

点击查看摘要

Abstract:Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.

13. 【2511.02650】Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

链接https://arxiv.org/abs/2511.02650

作者:Tianfan Peng,Yuntao Du,Pengzhou Ji,Shijie Dong,Kailin Jiang,Mingchuan Ma,Yijun Tian,Jinhe Bi,Qian Li,Wei Du,Feng Xiao,Lizhen Cui

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe inference inefficiency, inference inefficiency due, Large multimodal models, large number, visual tokens introduced

备注

点击查看摘要

Abstract:Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

14. 【2511.02645】Robust Face Liveness Detection for Biometric Authentication using Single Image

链接https://arxiv.org/abs/2511.02645

作者:Poulami Raha,Yeongnam Chae

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Face Recognition Systems, adopted in security, technologies are widely, widely adopted, Face recognition

备注

点击查看摘要

Abstract:Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: this https URL

15. 【2511.02607】UniChange: Unifying Change Detection with Multimodal Large Language Model

链接https://arxiv.org/abs/2511.02607

作者:Xu Zhang,Danyang Li,Xiaohang Dong,Tianhao Wu,Hualong Yu,Jianye Wang,Qicheng Li,Xiang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:land cover dynamics, analyzing land cover, cover dynamics, Change detection, monitoring and analyzing

备注

点击查看摘要

Abstract:Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at this https URL.

16. 【2511.02591】Zero-Shot Multi-Animal Tracking in the Wild

链接https://arxiv.org/abs/2511.02591

作者:Jan Frederik Meier,Timo Lüddecke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding animal ecology, ecology and behavior, crucial for understanding, understanding animal, animal ecology

备注

点击查看摘要

Abstract:Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at this https URL.

17. 【2511.02580】AUE: Training-free Noise Transplant and Cultivation Diffusion Model

链接https://arxiv.org/abs/2511.02580

作者:Daichi Nagai,Ryugo Morita,Shunsuke Kitada,Hitoshi Iyatomi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:Cultivation Diffusion Model, flattened image remains, Training-free Noise Transplantation, Transplantation and Cultivation, remarkable success

备注: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: [this https URL](https://iyatomilab.github.io/TAUE)

点击查看摘要

Abstract:Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.

18. 【2511.02565】A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding

链接https://arxiv.org/abs/2511.02565

作者:Jingyu Lu,Haonan Wang,Qixiang Zhang,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:holds great potential, reconstruct continuous visual, continuous visual experiences, subject-specific training, holds great

备注: 9 pages main text with 6 figures (excluding references), supplementary material included

点击查看摘要

Abstract:Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.

19. 【2511.02564】Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

链接https://arxiv.org/abs/2511.02564

作者:Md Rashidunnabi,Kailash A. Hambarde,Vasco Lopes,Joao C. Neves,Hugo Proenca

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video-based person re-identification, extreme viewpoint shifts, Video-based person, aerial-ground surveillance, person re-identification

备注

点击查看摘要

Abstract:Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at this https URL

20. 【2511.02563】he Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic

链接https://arxiv.org/abs/2511.02563

作者:Akash Sharma,Chinmay Mhatre,Sankalp Gawali,Ruthvik Bokkasam,Brij Kishore,Vishwajeet Pattanaik,Tarun Rambha,Abdul R. Pinjari,Vijay Kovvali,Anirban Chakraborty,Punit Rathore,Raghu Krishnapuram,Yogesh Simmhan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Bengaluru Safe-City CCTV, Light Commercial Vehicles, India, Safe-City CCTV cameras, Majority Voting

备注

点击查看摘要

Abstract:This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.

21. 【2511.02560】SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration

链接https://arxiv.org/abs/2511.02560

作者:Dan Bohus,Sean Andrist,Ann Paradiso,Nick Saw,Tim Schoonbeek,Maia Stiber

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:dataset enabling research, dataset enabling, enabling research, introduce SigmaCollab, set

备注

点击查看摘要

Abstract:We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at this https URL.

22. 【2511.02558】Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction

链接https://arxiv.org/abs/2511.02558

作者:Ali Farki,Elaheh Moradi,Deepika Koundal,Jussi Tohka

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词:magnetic resonance image, baseline magnetic resonance, Alzheimer disease, Predicting future brain, studying neurodegenerative diseases

备注

点击查看摘要

Abstract:Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

23. 【2511.02541】Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data

链接https://arxiv.org/abs/2511.02541

作者:Jessica Plassmann,Nicolas Schuler,Georg von Freymann,Michael Schuth

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering high sensitivity, non-destructive testing method, full-field inspection capabilities, detecting subsurface defects, offering high

备注: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025

点击查看摘要

Abstract:Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.

24. 【2511.02510】LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization

链接https://arxiv.org/abs/2511.02510

作者:Jee Won Lee,Jongseong Brad Choi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optimization-based scene reconstruction, brittle pruning heuristics, underfit low-frequency content, Sparse-voxel rasterization, differentiable alternative

备注

点击查看摘要

Abstract:Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.

25. 【2511.02507】Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems

链接https://arxiv.org/abs/2511.02507

作者:Nicolas Schuler,Lea Dewald,Jürgen Graf

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:integrated Artificial Intelligence, Deep Learning enable, Learning enable hardware-based, Artificial Intelligence, enable hardware-based cognitive

备注: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025 International Conference, PARIS, FRANCE December 3-5 2025

点击查看摘要

Abstract:Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.

26. 【2511.02505】ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing

链接https://arxiv.org/abs/2511.02505

作者:Yaosen Chen,Wei Wang,Xuming Wen,Han Yang,Yanru Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:convey information, involving the sequencing, evoke emotions, video editing, crucial step

备注

点击查看摘要

Abstract:Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot this http URL address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: this https URL

27. 【2511.02503】Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

链接https://arxiv.org/abs/2511.02503

作者:Robinson Umeike,Neil Getty,Yin Xiangyu,Yi Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:show great potential, Language Models, show great, great potential, automation of workflows

备注

点击查看摘要

Abstract:The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.

28. 【2511.02495】DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

链接https://arxiv.org/abs/2511.02495

作者:Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:demonstrated strong performance, Recent advances, demonstrated strong, strong performance, Recent

备注: Advances in Neural Information Processing Systems 2025 (NeurIPS 2025), Poster, [this https URL](https://neurips.cc/virtual/2025/loc/san-diego/poster/121400)

点击查看摘要

Abstract:Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at this https URL

29. 【2511.02489】Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization

链接https://arxiv.org/abs/2511.02489

作者:Tao Liu,Kan Ren,Qian Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low-altitude economy, patrol systems, rapid growth, crucial for measurement, measurement and tracking

备注: 20 pages, Submitted to IEEE TIM

点击查看摘要

Abstract:With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: this https URL.

30. 【2511.02483】OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control

链接https://arxiv.org/abs/2511.02483

作者:Xilong Zhou,Jianchun Chen,Pramod Rao,Timo Teufel,Linjie Lyu,Tigran Minasian,Oleksandr Sotnychenko,Xiaoxiao Long,Marc Habermann,Christian Theobalt

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:large-scale dataset comprising, precisely controlled lighting, multiple viewpoints, controlled lighting conditions, dataset comprising

备注

点击查看摘要

Abstract:We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at this https URL.

31. 【2511.02473】MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer

链接https://arxiv.org/abs/2511.02473

作者:Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shotaro Tora

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recognize human actions, multiple camera views, Multi-view action recognition, obstacles or crowds, aims to recognize

备注: Selected as Best Industry Paper Award at ICIP2024

点击查看摘要

Abstract:Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.

32. 【2511.02468】HAGI++: Head-Assisted Gaze Imputation and Generation

链接https://arxiv.org/abs/2511.02468

作者:Chuhan Jiao,Zhiming Hu,Andreas Bulling

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Mobile eye tracking, eye tracking plays, capturing human visual, human visual attention, Mobile eye

备注: Extended version of our UIST'25 paper "HAGI: Head-Assisted Gaze Imputation for Mobile Eye Trackers"

点击查看摘要

Abstract:Mobile eye tracking plays a vital role in capturing human visual attention across both real-world and extended reality (XR) environments, making it an essential tool for applications ranging from behavioural research to human-computer interaction. However, missing values due to blinks, pupil detection errors, or illumination changes pose significant challenges for further gaze data analysis. To address this challenge, we introduce HAGI++ - a multi-modal diffusion-based approach for gaze data imputation that, for the first time, uses the integrated head orientation sensors to exploit the inherent correlation between head and eye movements. HAGI++ employs a transformer-based diffusion model to learn cross-modal dependencies between eye and head representations and can be readily extended to incorporate additional body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D, and HOT3D datasets demonstrate that HAGI++ consistently outperforms conventional interpolation methods and deep learning-based time-series imputation baselines in gaze imputation. Furthermore, statistical analyses confirm that HAGI++ produces gaze velocity distributions that closely match actual human gaze behaviour, ensuring more realistic gaze imputations. Moreover, by incorporating wrist motion captured from commercial wearable devices, HAGI++ surpasses prior methods that rely on full-body motion capture in the extreme case of 100% missing gaze data (pure gaze generation). Our method paves the way for more complete and accurate eye gaze recordings in real-world settings and has significant potential for enhancing gaze-based analysis and interaction across various application domains.

33. 【2511.02462】KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image

链接https://arxiv.org/abs/2511.02462

作者:Teerapong Panboonyuen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately restoring missing, robust image analysis, Satellite image inpainting, Satellite image, Massachusetts Roads Dataset

备注: 18 pages

点击查看摘要

Abstract:Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.

34. 【2511.02427】From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

链接https://arxiv.org/abs/2511.02427

作者:Nicolas Schuler,Lea Dewald,Nick Baldig,Jürgen Graf

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Video Understanding, Commonsense Reasoning, make rational decisions, highly challenging tasks, Visual Language Models

备注: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025

点击查看摘要

Abstract:Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: this https URL

35. 【2511.02417】Synthetic Crop-Weed Image Generation and its Impact on Model Generalization

链接https://arxiv.org/abs/2511.02417

作者:Garen Boyadjian(INRAE),Cyrille Pierre(INRAE),Johann Laconte(INRAE, UR TSCF),Riccardo Bertoglio(INRAE)

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Precise semantic segmentation, Precise semantic, agricultural weeding robots, weeding robots, Precise

备注

点击查看摘要

Abstract:Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.

36. 【2511.02415】ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

链接https://arxiv.org/abs/2511.02415

作者:Duo Xu,Hao Cheng,Xin Lin,Zhen Xie,Hao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, understanding tasks demand, tasks demand advanced, demand advanced visual, advanced visual recognition

备注: 23 pages, EMNLP25 Accepted

点击查看摘要

Abstract:Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K QA pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

37. 【2511.02411】IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition

链接https://arxiv.org/abs/2511.02411

作者:Wenyang Wei,Yang yang,Xixi Jia,Xiangchu Feng,Weiwei Wang,Renzhen Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:conditional Rectified Flow, synergizes conditional Rectified, Retinex theory, Rectified Flow, CRF

备注

点击查看摘要

Abstract:We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.

38. 【2511.02404】Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

链接https://arxiv.org/abs/2511.02404

作者:Arya Shah,Vaibhav Tripathi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ocular anatomy, differ in ocular, Centered Kernel Alignment, Felis Catus, Representational Similarity Analysis

备注

点击查看摘要

Abstract:Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.

39. 【2511.02397】A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds

链接https://arxiv.org/abs/2511.02397

作者:Kuo-Liang Chung,Ting-Chung Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:color point clouds, Color, target points, color point, point clouds

备注

点击查看摘要

Abstract:Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: this https URL.

40. 【2511.02395】Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds

链接https://arxiv.org/abs/2511.02395

作者:Leon Schwarzer,Matthias Zeller,Daniel Casado Herraez,Simon Dierl,Michael Heidingsfeld,Cyrill Stachniss

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:reliable autonomous mobile, autonomous mobile systems, Moving object segmentation, SLAM or path, object segmentation

备注: Accepted for publication at IEEE International Conference on Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures

点击查看摘要

Abstract:Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.

41. 【2511.02384】RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

链接https://arxiv.org/abs/2511.02384

作者:Jiahe Song,Chuang Wang,Bowen Jiang,Yinfan Wang,Hao Zheng,Xingjian Wei,Chengjin Liu,Junyuan Gao,Yubin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale chemical reaction, Large-scale chemical, chemical reaction, chemical Reaction Diagram, Reaction Diagram Parsing

备注

点击查看摘要

Abstract:Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

42. 【2511.02360】CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

链接https://arxiv.org/abs/2511.02360

作者:Jizheng Ma,Xiaofei Zhou,Yanlong Song,Han Yan

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:numerous thought processes, exist numerous thought, human cognition, verbal expression, exist numerous

备注

点击查看摘要

Abstract:In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

43. 【2511.02349】M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings

链接https://arxiv.org/abs/2511.02349

作者:Jiankai Tang,Tao Zhang,Jia Li,Yiru Zhang,Mingyu Zhang,Kegang Wang,Yuming Hao,Bolin Wang,Haiyang Li,Xingyao Wang,Yuanchun Shi,Yuntao Wang,Sichong Qian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Portable physiological monitoring, require specialized equipment, impose impractical postures, Portable physiological, physiological monitoring

备注

点击查看摘要

Abstract:Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: this https URL.

44. 【2511.02335】GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection

链接https://arxiv.org/abs/2511.02335

作者:Kun Zou,Yongheng Xu,Jianxing Yu,Yan Pan,Jian Yin,Hanjiang Lai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:OOD detection, OOD, real-world applications, paramount to ensuring, ensuring the reliability

备注

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC's competitive performance and strong generalization ability compared to those of state-of-the-art methods.

45. 【2511.02329】Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization

链接https://arxiv.org/abs/2511.02329

作者:Shaohan Li,Yunpeng Shi,Gilad Lerman

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Numerical Analysis (math.NA); Methodology (stat.ME)

关键词:camera location estimation, framework for estimating, location estimation, estimating camera poses, camera location

备注: NeurIPS 2025 spotlight paper

点击查看摘要

Abstract:We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) -- originally developed for group synchronization -- to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone -- without access to inter-camera distances -- suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.

46. 【2511.02293】3D Point Cloud Object Detection on Edge Devices for Split Computing

链接https://arxiv.org/abs/2511.02293

作者:Taisuke Noguchi,Takuya Azumi

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving technology, rapidly advancing, key component, autonomous driving, driving technology

备注: 6 pages. This version includes minor lstlisting configuration adjustments for successful compilation. No changes to content or layout. Originally published at ACM/IEEE RAGE 2024

点击查看摘要

Abstract:The field of autonomous driving technology is rapidly advancing, with deep learning being a key component. Particularly in the field of sensing, 3D point cloud data collected by LiDAR is utilized to run deep neural network models for 3D object detection. However, these state-of-the-art models are complex, leading to longer processing times and increased power consumption on edge devices. The objective of this study is to address these issues by leveraging Split Computing, a distributed machine learning inference method. Split Computing aims to lessen the computational burden on edge devices, thereby reducing processing time and power consumption. Furthermore, it minimizes the risk of data breaches by only transmitting intermediate data from the deep neural network model. Experimental results show that splitting after voxelization reduces the inference time by 70.8% and the edge device execution time by 90.0%. When splitting within the network, the inference time is reduced by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.

47. 【2511.02288】Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions

链接https://arxiv.org/abs/2511.02288

作者:Cuong Tuan Nguyen,Ngoc Tuan Nguyen,Triet Hoang Minh Dao,Huy Minh Nhat,Huy Truong Dinh

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Handwritten Mathematical Expression, Mathematical Expression, Handwritten Mathematical, Graph Neural Network, capture spatial dependencies

备注: accepted for ICDAR2025-WML

点击查看摘要

Abstract:We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.

48. 【2511.02280】SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

链接https://arxiv.org/abs/2511.02280

作者:Fangxun Shu,Yongjie Ye,Yue Liao,Zijian Kang,Weijie Yin,Jiacong Wang,Xiao Liang,Shuicheng Yan,Chao Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:multimodal large language, large language models, reinforcement learning, large language, introduce SAIL-RL

备注

点击查看摘要

Abstract:We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at this https URL.

49. 【2511.02277】Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows?

链接https://arxiv.org/abs/2511.02277

作者:Giorgos Sfikas,Konstantina Nikolaidou,Foteini Papadopoulou,George Retsinas,Anastasios L. Kesidis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Computer Vision, Object pose estimation, central importance, inherent object symmetries, Normalizing Flows model

备注: BMVC 2025 workshop proceedings (Smart Cameras for Smarter Autonomous Vehicles Robots)

点击查看摘要

Abstract:Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.

50. 【2511.02271】Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

链接https://arxiv.org/abs/2511.02271

作者:Yucheng Song,Yifan Ge,Junhao Li,Zhining Liao,Zhifang Liao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical Report Generation, automatically generates reports, Report Generation, modern medical diagnostics, reduce radiologists' burden

备注

点击查看摘要

Abstract:Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

51. 【2511.02247】Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency

链接https://arxiv.org/abs/2511.02247

作者:Hao Li,Daiwei Lu,Jesse d'Almeida,Dilara Isik,Ehsan Khodapanah Aghdam,Nick DiSanto,Ayberk Acar,Susheela Sharma,Jie Ying Wu,Robert J. Webster III,Ipek Oguz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous medical robots, guide autonomous medical, Monocular depth estimation, Monocular depth, depth

备注

点击查看摘要

Abstract:Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at this https URL.

52. 【2511.02228】Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis

链接https://arxiv.org/abs/2511.02228

作者:Delin Ma,Menghui Zhou,Jun Qi,Yun Yang,Po Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:slowing disease progression, Alzheimer disease, disease progression, form of dementia, slowing disease

备注

点击查看摘要

Abstract:Alzheimer's disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.

53. 【2511.02215】Can Foundation Models Revolutionize Mobile AR Sparse Sensing?

链接https://arxiv.org/abs/2511.02215

作者:Yiqin Zhao,Tian Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)

关键词:Sparse sensing, mobile sparse sensing, long faced, faced a fundamental, fundamental trade-off

备注

点击查看摘要

Abstract:Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.

54. 【2511.02210】Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning

链接https://arxiv.org/abs/2511.02210

作者:Anders Austlid Taskén,Thierry Judge,Erik Andreas Rye Berg,Jinyang Yu,Bjørnar Grenne,Frank Lindseth,Svend Aakhus,Pierre-Marc Jodoin,Nicolas Duchateau,Olivier Bernard,Gabriel Kiss

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Segmental longitudinal strain, important prognostic indicator, Segmental longitudinal, left ventricle, regional LV dysfunction

备注: 13 pages, IEEE Journal of Biomedical and Health Informatics

点击查看摘要

Abstract:Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.

Comments:
13 pages, IEEE Journal of Biomedical and Health Informatics

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

Cite as:
arXiv:2511.02210 [cs.CV]

(or
arXiv:2511.02210v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.02210

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1109/JBHI.2025.3605793

Focus to learn more

            DOI(s) linking to related resources</p>
55. 【2511.02207】Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping

链接https://arxiv.org/abs/2511.02207

作者:Jiajia Li,Keyi Zhu,Qianwen Zhang,Dong Chen,Qi Sun,Zhaojian Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:United States, total fruit production, economically significant fruits, annual farm-gate sales, significant fruits

备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.

56. 【2511.02206】Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers

链接https://arxiv.org/abs/2511.02206

作者:Zhengjie Zhang,Xiaoxie Mao,Qihao Guo,Shaoting Zhang,Qi Huang,Mu Zhou,Fang Xie,Mianxin Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diagnosis heavily relies, positron emission tomography, amyloid-beta positron emission, synthetic PET, PET

备注: 31 pages, 8 figures

点击查看摘要

Abstract:Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.

57. 【2511.02205】OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning

链接https://arxiv.org/abs/2511.02205

作者:Kevin Valencia,Thilina Balasooriya,Xihaier Luo,Shinjae Yoo,David Keetae Park

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world experimental data, test time, cross-modally correlated, shrinking the usable, Multimodal spatiotemporal learning

备注: 25 pages, 12 figures, 8 tables

点击查看摘要

Abstract:Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.

58. 【2511.02193】MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation

链接https://arxiv.org/abs/2511.02193

作者:Jiawen Liu,Yuanbo Zeng,Jiaming Liang,Yizhen Yang,Yiheng Zhang,Enhui Cai,Xiaoqi Sheng,Hongmin Cai

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:health status indicators, Accurate detection, retinal vessel segmentation, retinal vessels plays, ocular diseases

备注: This paper was accepted by IEEE BIBM 2025 conference

点击查看摘要

Abstract:Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via this https URL.

59. 【2511.02182】Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

链接https://arxiv.org/abs/2511.02182

作者:Jinhwan Seo,Yoonki Cho,Junhyug Noh,Sung-eui Yoon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Perception Test Challenge, Video Question Answering, Grounded Video Question, address Grounded Video, Question Answering

备注: 1st place winner of Grounded Videoqa track at the ICCV2025 Perception Test

点击查看摘要

Abstract:In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \ QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

60. 【2511.02180】Autobiasing Event Cameras for Flickering Mitigation

链接https://arxiv.org/abs/2511.02180

作者:Mehdi Sefidgar Dilmaghani,Waseem Shariff,Cian Ryan,Joe Lemley,Peter Corcoran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding and mitigating, mitigating flicker effects, flicker effects caused, event cameras, diverse environments

备注

点击查看摘要

Abstract:Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.

61. 【2511.02144】Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis

链接https://arxiv.org/abs/2511.02144

作者:Zhicheng Wang,Junbiao Pang

类目:Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词:guiding maintenance interventions, assessing structural integrity, Accurate quantification, Principal Component Analysis, crack width plays

备注

点击查看摘要

Abstract:Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.

62. 【2511.02142】From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera

链接https://arxiv.org/abs/2511.02142

作者:Huahua Lin,Xiaohao Cai,Mark Nixon,James M. Mulqueeney,Thomas H. G. Ezard

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:marine protists characterized, intricate chambered shells, present environmental conditions, Planktonic foraminifera, marine protists

备注

点击查看摘要

Abstract:Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.

63. 【2511.02097】A Step Toward World Models: A Survey on Robotic Manipulation

链接https://arxiv.org/abs/2511.02097

作者:Peng-Fei Zhang,Ying Cheng,Xiaofan Sun,Shijie Wang,Lei Zhu,Heng Tao Shen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Autonomous agents, world models, operate in complex, uncertain environments, performing tasks

备注: 24 pages, 5 figures

点击查看摘要

Abstract:Autonomous agents are increasingly expected to operate in complex, dynamic, and uncertain environments, performing tasks such as manipulation, navigation, and decision-making. Achieving these capabilities requires agents to understand the underlying mechanisms and dynamics of the world, moving beyond purely reactive control or simple replication of observed states. This motivates the development of world models as internal representations that encode environmental states, capture dynamics, and enable prediction, planning, and reasoning. Despite growing interest, the definition, scope, architectures, and essential capabilities of world models remain ambiguous. In this survey, rather than directly imposing a fixed definition and limiting our scope to methods explicitly labeled as world models, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a real world model should possess. Building on this analysis, we aim to outline a roadmap for developing generalizable and practical world models for robotics.

64. 【2511.02086】Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study

链接https://arxiv.org/abs/2511.02086

作者:Yue Yang,Fabian Necker,Christoph Leuze,Michelle Chen,Andrey Finegersh,Jake Lee,Vasu Divi,Bruce Daniel,Brian Hargreaves,Jie Ying Wu,Fred M Baik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:markerless augmented reality, Articulated HAnd Tracking, real-life operative settings, augmented reality, head-mounted display

备注

点击查看摘要

Abstract:Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing "skin-to-bone" relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.

65. 【2511.02046】xt-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

链接https://arxiv.org/abs/2511.02046

作者:Soham Joshi,Shwet Kamal Mishra,Viswanath Gopalakrishnan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual Question Answering, Answering tasks pertaining, Question Answering tasks, involves skilful human, skilful human annotation

备注: First two authors contributed equally

点击查看摘要

Abstract:Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

66. 【2511.02027】StrengthSense: A Dataset of IMU Signals Capturing Everyday Strength-Demanding Activities

链接https://arxiv.org/abs/2511.02027

作者:Zeyu Yang,Clayton Souza Leite,Yu Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Tracking strength-demanding activities, monitoring muscular strength, Tracking strength-demanding, muscular strength, strength-demanding activities

备注

点击查看摘要

Abstract:Tracking strength-demanding activities with wearable sensors like IMUs is crucial for monitoring muscular strength, endurance, and power. However, there is a lack of comprehensive datasets capturing these activities. To fill this gap, we introduce \textit{StrengthSense}, an open dataset that encompasses IMU signals capturing 11 strength-demanding activities, such as sit-to-stand, climbing stairs, and mopping. For comparative purposes, the dataset also includes 2 non-strength demanding activities. The dataset was collected from 29 healthy subjects utilizing 10 IMUs placed on limbs and the torso, and was annotated using video recordings as references. This paper provides a comprehensive overview of the data collection, pre-processing, and technical validation. We conducted a comparative analysis between the joint angles estimated by IMUs and those directly extracted from video to verify the accuracy and reliability of the sensor data. Researchers and developers can utilize \textit{StrengthSense} to advance the development of human activity recognition algorithms, create fitness and health monitoring applications, and more.

67. 【2511.02014】owards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

链接https://arxiv.org/abs/2511.02014

作者:Tuan Truong,Guillermo Jimenez Perez,Pedro Osorio,Matthias Lenga

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Protected Health Information, Health Information, Protected Health, safeguarding patient privacy, Optical Character Recognition

备注: Submitted to ISBI 2026

点击查看摘要

Abstract:The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

68. 【2511.01998】Locally-Supervised Global Image Restoration

链接https://arxiv.org/abs/2511.01998

作者:Benjamin Walder,Daniel Toader,Robert Nuster,Günther Paltauf,Peter Burgholzer,Gregor Langer,Lukas Krainer,Markus Haltmeier

类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)

关键词:learning-based framework, address the problem, ground truth data, incomplete measurements, truth data

备注

点击查看摘要

Abstract:We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.

69. 【2511.01990】Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users

链接https://arxiv.org/abs/2511.01990

作者:Saurabh Kaushik,Lalit Maurya,Elizabeth Tellman,ZhiJie Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improving flood inundation, enable fast, satellite imagery, fast and reliable, reliable extraction

备注

点击查看摘要

Abstract:Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay's superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.

70. 【2511.01932】Deciphering Personalization: Towards Fine-Grained Explainability in Natural Language for Personalized Image Generation Models

链接https://arxiv.org/abs/2511.01932

作者:Haoming Wang,Wei Gao

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:individual users' heterogeneous, Image generation models, meet the individual, individual users', users' heterogeneous

备注

点击查看摘要

Abstract:Image generation models are usually personalized in practical uses in order to better meet the individual users' heterogeneous needs, but most personalized models lack explainability about how they are being personalized. Such explainability can be provided via visual features in generated images, but is difficult for human users to understand. Explainability in natural language is a better choice, but the existing approaches to explainability in natural language are limited to be coarse-grained. They are unable to precisely identify the multiple aspects of personalization, as well as the varying levels of personalization in each aspect. To address such limitation, in this paper we present a new technique, namely \textbf{FineXL}, towards \textbf{Fine}-grained e\textbf{X}plainability in natural \textbf{L}anguage for personalized image generation models. FineXL can provide natural language descriptions about each distinct aspect of personalization, along with quantitative scores indicating the level of each aspect of personalization. Experiment results show that FineXL can improve the accuracy of explainability by 56\%, when different personalization scenarios are applied to multiple types of image generation models.

71. 【2511.01915】Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound

链接https://arxiv.org/abs/2511.01915

作者:Edoardo Conti,Riccardo Rosati,Lorenzo Federici,Adriano Mancini,Maria Chiara Fiorentin

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:foundation models, inter-class variability conditions, vision foundation models, fetal ultrasound, Purpose

备注

点击查看摘要

Abstract:Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Cite as:
arXiv:2511.01915 [cs.CV]

(or
arXiv:2511.01915v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.01915

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Maria Chiara Fiorentino [view email] [v1]
Sat, 1 Nov 2025 13:37:22 UTC (2,827 KB)

72. 【2511.01914】FlyBot-VLA Technical Report

链接https://arxiv.org/abs/2511.01914

作者:Yuan Zhang,Chenyu Xue,Wenjie Xu,Chao Ji,Jiajia wu,Jia Pan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:latent action model, action model, VLM, action, action representation framework

备注

点击查看摘要

Abstract:We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community

73. 【2511.02717】An unscented Kalman filter method for real time input-parameter-state estimation

链接https://arxiv.org/abs/2511.02717

作者:Marios Impraimakis,Andrew W. Smyth

类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)

关键词:unscented Kalman filter, unscented Kalman, Kalman filter, filter is examined, linear and nonlinear

备注: author-accepted manuscript (AAM) published in Mechanical Systems and Signal Processing

点击查看摘要

Abstract:The input-parameter-state estimation capabilities of a novel unscented Kalman filter is examined herein on both linear and nonlinear systems. The unknown input is estimated in two stages within each time step. Firstly, the predicted dynamic states and the system parameters provide an estimation of the input. Secondly, the corrected with measurements states and parameters provide a final estimation. Importantly, it is demonstrated using the perturbation analysis that, a system with at least a zero or a non-zero known input can potentially be uniquely identified. This output-only methodology allows for a better understanding of the system compared to classical output-only parameter identification strategies, given that all the dynamic states, the parameters, and the input are estimated jointly and in real-time.

74. 【2511.02576】Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback

链接https://arxiv.org/abs/2511.02576

作者:Alix de Langlais,Benjamin Billot,Théo Aguilar Vidal,Marc-Olivier Gauci,Hervé Delingette

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Delineating anatomical regions, Delineating anatomical, medical image analysis, anatomical regions, key task

备注

点击查看摘要

Abstract:Delineating anatomical regions is a key task in medical image analysis. Manual segmentation achieves high accuracy but is labor-intensive and prone to variability, thus prompting the development of automated approaches. Recently, a breadth of foundation models has enabled automated segmentations across diverse anatomies and imaging modalities, but these may not always meet the clinical accuracy standards. While segmentation refinement strategies can improve performance, current methods depend on heavy user interactions or require fully supervised segmentations for training. Here, we present SCORE (Segmentation COrrection from Regional Evaluations), a weakly supervised framework that learns to refine mask predictions only using light feedback during training. Specifically, instead of relying on dense training image annotations, SCORE introduces a novel loss that leverages region-wise quality scores and over/under-segmentation error labels. We demonstrate SCORE on humerus CT scans, where it considerably improves initial predictions from TotalSegmentator, and achieves performance on par with existing refinement methods, while greatly reducing their supervision requirements and annotation time. Our code is available at: this https URL.

75. 【2511.02426】A Kullback-Leibler divergence method for input-system-state identification

链接https://arxiv.org/abs/2511.02426

作者:Marios Impraimakis

类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Systems and Control (eess.SY)

关键词:Kalman filter framework, Kalman filter, initial parameter set, Kullback-Leibler divergence, plausible results

备注: 32 pages, 17 figures, published in Journal of Sound and Vibration

点击查看摘要

Abstract:The capability of a novel Kullback-Leibler divergence method is examined herein within the Kalman filter framework to select the input-parameter-state estimation execution with the most plausible results. This identification suffers from the uncertainty related to obtaining different results from different initial parameter set guesses, and the examined approach uses the information gained from the data in going from the prior to the posterior distribution to address the issue. Firstly, the Kalman filter is performed for a number of different initial parameter sets providing the system input-parameter-state estimation. Secondly, the resulting posterior distributions are compared simultaneously to the initial prior distributions using the Kullback-Leibler divergence. Finally, the identification with the least Kullback-Leibler divergence is selected as the one with the most plausible results. Importantly, the method is shown to select the better performed identification in linear, nonlinear, and limited information applications, providing a powerful tool for system monitoring.

76. 【2511.02400】MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization

链接https://arxiv.org/abs/2511.02400

作者:Yalda Zafari,Hongyi Pan,Gorkem Durak,Ulas Bagci,Essam A. Rashed,Mohamed Mabrok

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:reliable artificial intelligence, clinically reliable artificial, artificial intelligence, clinically reliable, reliable artificial

备注

点击查看摘要

Abstract:The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: this https URL.

77. 【2511.02212】High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network

链接https://arxiv.org/abs/2511.02212

作者:Abuobaida M.Khair,Wenjing Jiang,Yousuf Babiker M. Osman,Wenjun Xia,Xiaopeng Ma

类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Magnetic Particle Imaging, Residual Feature Network, Feature Network, Particle Imaging, Vision Transformer

备注

点击查看摘要

Abstract:This study presents a hybrid deep learning framework, the Vision Transformer with Residual Feature Network (VRF-Net), for recovering high-resolution system matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from downsampling and coil sensitivity variations. VRF-Net addresses these challenges by combining transformer-based global attention with residual convolutional refinement, enabling recovery of both large-scale structures and fine details. To reflect realistic MPI conditions, the system matrix is degraded using a dual-stage downsampling strategy. Training employed paired-image super-resolution on the public Open MPI dataset and a simulated dataset incorporating variable coil sensitivity profiles. For system matrix recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08 dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x scaling, with stable performance at higher scales. On average, it reduced nRMSE by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over interpolation and CNN-based methods. In image reconstruction of Open MPI phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960), outperforming existing methods. These findings demonstrate that VRF-Net enables sharper, artifact-free system matrix recovery and robust image reconstruction across multiple scales, offering a promising direction for future in vivo applications.

78. 【2511.02065】Opto-Electronic Convolutional Neural Network Design Via Direct Kernel Optimization

链接https://arxiv.org/abs/2511.02065

作者:Ali Almuallem,Harshana Weligampola,Abhiram Gnanasambandam,Wei Xu,Dilshan Godaliyadda,Hamid R. Sheikh,Stanley H. Chan,Qi Guo

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:networks integrate optical, integrate optical front-ends, neural networks integrate, energy-efficient vision, back-ends to enable

备注

点击查看摘要

Abstract:Opto-electronic neural networks integrate optical front-ends with electronic back-ends to enable fast and energy-efficient vision. However, conventional end-to-end optimization of both the optical and electronic modules is limited by costly simulations and large parameter spaces. We introduce a two-stage strategy for designing opto-electronic convolutional neural networks (CNNs): first, train a standard electronic CNN, then realize the optical front-end implemented as a metasurface array through direct kernel optimization of its first convolutional layer. This approach reduces computational and memory demands by hundreds of times and improves training stability compared to end-to-end optimization. On monocular depth estimation, the proposed two-stage design achieves twice the accuracy of end-to-end training under the same training time and resource constraints.